Find answers, ask questions, and connect with our
community around the world.

  • toti

    Member
    June 26, 2025 at 7:14 am

    – Step 1: Filter out irrelevant columns, focusing only on the comments, likes, time, and date.

    – Step 2: Compress the data to reduce storage requirements and improve processing time.

    – Step 3: Transform the CSV data into a columnar storage format for faster access.– Step 1: Filter out irrelevant columns, focusing only on the comments, likes, time, and date.

    – Step


    the above mentioned correct answer is for a question from timed mode test 1. the issue here is that, how can you say “to filter out irrelevant columns” while the dataset is still in CSV, because certainly according to the correct answer, it won’t get converted into columnar format until step 3.

    Could you explain this please.

  • Irene-TutorialsDojo

    Administrator
    June 26, 2025 at 12:45 pm

    Hi Toti,

    Thank you for your question and for bringing this to our attention.

    We understand your concern about filtering columns while the data is still in CSV format. Filtering can be done on CSV files because it involves selecting specific columns, such as reviews, ratings, and timestamps, based on the dataset’s structure. AWS Glue can read CSV files, identify their schema, and filter out unnecessary columns without requiring a columnar format. Doing this first reduces the data size, making the following steps more efficient.

    The correct order is to filter columns first, then convert the CSV to a columnar format like Parquet, and finally compress the data. Filtering reduces the data processed during conversion to a format designed for faster queries. Compression then lowers storage needs and improves processing speed. We’ll update the portal to clarify this. If you have other queries, please don’t hesitate to reach out.

    Best,

    Irene @ Tutorials Dojo

Viewing 1 - 2 of 2 replies

Log in to reply.

Original Post
0 of 0 posts June 2018
Now
Skip to content