Ends in
00
days
00
hrs
00
mins
00
secs
SHOP NOW

🤖 Get 25% OFF on AI & ML Practice Exams, Video Courses, and eBooks – AWS, Azure, Google Cloud, and GitHub Reviewers!

Find answers, ask questions, and connect with our
community around the world.

Home Forums AWS AWS Certified Machine Learning Engineer Associate MLA-C01 Transform first or compress?

  • Transform first or compress?

  • Mohiddin Shaik

    Member
    March 19, 2026 at 9:01 am

    A media company is developing a recommendation engine to boost user engagement by analyzing interactions. They collect and store data like user reviews, ratings, timestamps, and profiles (age, preferences, viewing history). This data is currently stored in a CSV (comma-separated value) format. The company also tracks the names of the content creators, which are stored alongside the interaction data.

    Select and order the steps a data scientist could take to optimize the data format and improve processing performance. Each step should be selected one time or not at all. (Select and order THREE.)

    Filter out irrelevant columns, focusing only on the reviews, ratings, and timestamps.

    Convert the data from CSV format to a hierarchical format such as JSON.

    Compress the data to reduce storage requirements and improve processing time.

    Keep the data in the existing CSV format and process all columns as-is.

    Aggregate the data by content creators’ names.

    Transform the CSV data into a columnar storage format for faster access.

    Step 1: Filter out irrelevant columns, focusing only on the comments, likes, time, and date.

    – Step 2: Compress the data to reduce storage requirements and improve processing time.

    – Step 3: Transform the CSV data into a columnar storage format for faster access.

    many sources say that it is step 2 is transform and step 3 is compress. Please provide more detailed explanation, which is best and efficient

  • Irene-TutorialsDojo

    Administrator
    March 19, 2026 at 12:40 pm

    Hi Mohiddin,

    Thank you for reaching out and for the great question! The correct answer is Step 1: Filter → Step 2: Transform (to columnar/Parquet) → Step 3: Compress.

    The key reason is that compression works best after transformation into a columnar format like Parquet. Columnar storage formats compress by column, with the compression algorithm selected for each column’s data type. This saves storage space and reduces disk space and I/O during query processing. Compressing a CSV first (before transforming) applies a single generic codec to row-based data, which cannot be parallelized and yields significantly worse compression ratios.

    AWS Prescriptive Guidance explicitly states: “When authoring ETL jobs, we recommend outputting transformed data in a column-based data format. Columnar data formats, such as Apache Parquet and ORC, are designed to minimize data movement and maximize compression. Compressing data also helps reduce the amount of data stored, and it improves read/write operation performance.”

    Notice the deliberate order: transform to columnar first, then compress. The answer key in our practice exam reflects this AWS best practice, and we appreciate your diligence in verifying it! If you have further questions, don’t hesitate to ask.

    Cheers,

    Irene @ Tutorials Dojo

Viewing 1 - 2 of 2 replies

Log in to reply.

Original Post
0 of 0 posts June 2018
Now
Skip to content