Ends in
00
days
00
hrs
00
mins
00
secs
SHOP NOW

🎊 Black Friday is here! Enjoy up to 30% OFF ALL AWS Associate-Level Courses!

Find answers, ask questions, and connect with our
community around the world.

Home Forums AWS AWS Certified Data Engineer Associate DEA-C01 Test mode set 1 – Redshift vs RDS read replicas

  • Test mode set 1 – Redshift vs RDS read replicas

  • paul10101

    Member
    October 17, 2025 at 7:19 am

    question on

    “A game retail company manages user purchase data in a MySQL database on Amazon RDS. They frequently query and analyze the most recent three months of data, amounting to several terabytes. The company also needs to store older data outside the database, yet access it for quarterly trend reports, which involves joining it with recent data.

    Which of the following options will provide optimal performance and a cost-effective solution based on the requirements?”

    Correct Answer: “Migrate historical data to Amazon S3. Set up a daily transfer of current data from Amazon RDS to Amazon Redshift. Use Redshift for regular queries and Redshift Spectrum to join the historical data in S3 with current data.”

    Incorrect answer: Synchronize a year’s worth of data on an Amazon RDS read replica. Move the older historical data to an Amazon S3 bucket for long-term storage. Create an AWS Glue Data Catalog and use Amazon Athena to join the historical and current data for report generation.

    Please help me understand why. The guide says of the incorrect answer “This approach is inefficient as it requires maintaining a full year’s worth of data on an RDS read replica, which can be costly and doesn’t leverage the optimized analytical capabilities of a dedicated analytics service like Amazon Redshift. Additionally, while Amazon Athena is effective for querying S3 data, it necessitates data movement between RDS and S3, which can be time-consuming and adds complexity.”
    BUT BOTH approaches require moving data from RDS to S3. Further, the answer labeled “Correct” requires additional overhead by adding Redshift and moving data from RDS to Redshift. This is also much more costly and less efficient than using the built in capability of RDS Read Replicas. So i don’t understand why using Redshift is better between the 2 options.

  • Nikee-TutorialsDojo

    Administrator
    October 17, 2025 at 9:25 am

    Hi Paul,

    Thank you for your question. Let me clarify why the Redshift + Redshift Spectrum approach is considered the optimal solution compared to using an RDS read replica + Athena.

    First, the query performance on recent (hot) data.

    The company frequently queries the last three months of data, which amounts to several terabytes. While an RDS read replica can technically handle some analytical queries, MySQL (OLTP) engines are not optimized for large-scale aggregations, joins, or trend analysis. Running queries on several terabytes of data in RDS can be slow and resource-intensive, especially when the queries need to join with historical data.

    In contrast, Amazon Redshift is a columnar, massively parallel processing (MPP) database designed specifically for analytics. By storing recent data in Redshift, the company can execute complex, frequent queries efficiently, delivering faster insights for reporting and sales analysis.

    Second, the cost and storage optimization.

    Maintaining a full RDS read replica for a year’s worth of data is expensive because it duplicates the primary database in storage and compute costs. The read replica still incurs continuous costs even if the historical data is rarely accessed.

    The Redshift + Redshift Spectrum approach separates hot and cold data effectively:

    • Hot data (for the past 3 months) has been stored in Redshift for fast query performance.

    • Cold data (older historical data) is stored in S3, which is much cheaper to maintain.

    With Redshift Spectrum, you can query historical data in S3 directly and join it with Redshift data on demand, without loading all historical data into Redshift. This minimizes storage and compute costs while maintaining query efficiency.

    Lastly, the efficiency of analytics workflows.

    Athena can query S3 directly, but joining Athena queries with RDS data is complex and slower. It requires either:

    • Moving data from RDS to S3 frequently, or

    • Using federated queries, which are less efficient for multi-terabyte datasets.

    Redshift + Redshift Spectrum simplifies this workflow by:

    • Keeping frequent queries on high-performance Redshift storage

    • Querying historical data directly from S3 without full data loads

    • Reducing operational overhead while maintaining high performance and reproducibility

    Although both approaches involve moving historical data to S3, the critical difference is how recent (hot) data is handled. The Redshift approach:

    • Provides fast, high-performance queries on large, frequently accessed datasets

    • Separates hot and cold data, optimizing storage costs

    • Supports scalable analytics across terabytes of data, including joins between recent and historical data

    While using an RDS read replica may appear more straightforward and leverage existing OLTP infrastructure, it is slower, less scalable, and more expensive for analytics workloads at this scale. The Redshift + Redshift Spectrum solution balances performance, cost, and scalability, making it the optimal choice.

    I hope this clears up your confusion. Please don’t hesitate to ask if you have any additional thoughts or need clarification.

    Regards,

    Nikee @ Tutorials Dojo

Viewing 1 - 2 of 2 replies

Log in to reply.

Original Post
0 of 0 posts June 2018
Now
Skip to content