Home › Forums › AWS › AWS Certified Machine Learning Engineer Associate MLA-C01 › Transform first or compress? › Reply To: Transform first or compress?
-
Hi Mohiddin,
Thank you for reaching out and for the great question! The correct answer is Step 1: Filter → Step 2: Transform (to columnar/Parquet) → Step 3: Compress.
The key reason is that compression works best after transformation into a columnar format like Parquet. Columnar storage formats compress by column, with the compression algorithm selected for each column’s data type. This saves storage space and reduces disk space and I/O during query processing. Compressing a CSV first (before transforming) applies a single generic codec to row-based data, which cannot be parallelized and yields significantly worse compression ratios.
AWS Prescriptive Guidance explicitly states: “When authoring ETL jobs, we recommend outputting transformed data in a column-based data format. Columnar data formats, such as Apache Parquet and ORC, are designed to minimize data movement and maximize compression. Compressing data also helps reduce the amount of data stored, and it improves read/write operation performance.”
Notice the deliberate order: transform to columnar first, then compress. The answer key in our practice exam reflects this AWS best practice, and we appreciate your diligence in verifying it! If you have further questions, don’t hesitate to ask.
Cheers,
Irene @ Tutorials Dojo