Home › Forums › AWS › AWS Certified Data Engineer Associate DEA-C01 › A data engineer needs to perform a one-time, ad-hoc query
-
A data engineer needs to perform a one-time, ad-hoc query
Nikee-TutorialsDojo updated 1 month ago
2 Members
·
2
Posts
-
A data engineer needs to perform a one-time, ad-hoc query to retrieve specific columns from a large dataset stored in an Amazon S3 bucket. The dataset contains 25 columns in Apache Parquet format. The solution should be efficient and avoid introducing any operational overhead or the need for managing new infrastructure
The correct answer is this “Utilize Amazon Athena to perform SQL SELECT queries to fetch specific columns.”
Which I agree, but either the question or the answer is missing context, to get the data into Athena you need to add it to Glue Catalog by running a crawler or via glue or any other available way. The data is not simply there for you to query if its in a bucket. -
Hi drsparrow,
Thank you for pointing this out.
The scenario in which this item is written leaves out an important detail. Athena itself doesn’t automatically know about the data stored in S3. To query it, the dataset must first be registered in the AWS Glue Data Catalog. You can accomplish this by running a Glue crawler, manually creating a table, or using other Glue-supported methods. Once the table is defined in the catalog, Athena can efficiently query only the specific columns you need without managing any infrastructure.
We’ll update this item to reflect the accurate workflow better. Again, thanks for highlighting this.
Regards,
Nikee @ Tutorials Dojo
Log in to reply.