Home › Forums › AWS › AWS Certified Data Engineer Associate DEA-C01 › Test mode set 1 – Redshift vs RDS read replicas › Reply To: Test mode set 1 – Redshift vs RDS read replicas
-
Hi Paul,
Thank you for your question. Let me clarify why the Redshift + Redshift Spectrum approach is considered the optimal solution compared to using an RDS read replica + Athena.
First, the query performance on recent (hot) data.
The company frequently queries the last three months of data, which amounts to several terabytes. While an RDS read replica can technically handle some analytical queries, MySQL (OLTP) engines are not optimized for large-scale aggregations, joins, or trend analysis. Running queries on several terabytes of data in RDS can be slow and resource-intensive, especially when the queries need to join with historical data.
In contrast, Amazon Redshift is a columnar, massively parallel processing (MPP) database designed specifically for analytics. By storing recent data in Redshift, the company can execute complex, frequent queries efficiently, delivering faster insights for reporting and sales analysis.
Second, the cost and storage optimization.
Maintaining a full RDS read replica for a year’s worth of data is expensive because it duplicates the primary database in storage and compute costs. The read replica still incurs continuous costs even if the historical data is rarely accessed.
The Redshift + Redshift Spectrum approach separates hot and cold data effectively:
-
Hot data (for the past 3 months) has been stored in Redshift for fast query performance.
-
Cold data (older historical data) is stored in S3, which is much cheaper to maintain.
With Redshift Spectrum, you can query historical data in S3 directly and join it with Redshift data on demand, without loading all historical data into Redshift. This minimizes storage and compute costs while maintaining query efficiency.
Lastly, the efficiency of analytics workflows.
Athena can query S3 directly, but joining Athena queries with RDS data is complex and slower. It requires either:
-
Moving data from RDS to S3 frequently, or
-
Using federated queries, which are less efficient for multi-terabyte datasets.
Redshift + Redshift Spectrum simplifies this workflow by:
-
Keeping frequent queries on high-performance Redshift storage
-
Querying historical data directly from S3 without full data loads
-
Reducing operational overhead while maintaining high performance and reproducibility
Although both approaches involve moving historical data to S3, the critical difference is how recent (hot) data is handled. The Redshift approach:
-
Provides fast, high-performance queries on large, frequently accessed datasets
-
Separates hot and cold data, optimizing storage costs
-
Supports scalable analytics across terabytes of data, including joins between recent and historical data
While using an RDS read replica may appear more straightforward and leverage existing OLTP infrastructure, it is slower, less scalable, and more expensive for analytics workloads at this scale. The Redshift + Redshift Spectrum solution balances performance, cost, and scalability, making it the optimal choice.
I hope this clears up your confusion. Please don’t hesitate to ask if you have any additional thoughts or need clarification.
Regards,
Nikee @ Tutorials Dojo
-