Home › Forums › AWS › AWS Certified Data Analytics – Specialty › Redshift DISTYLE KEY
-
Redshift DISTYLE KEY
-
Hi, I’d like to challenge the below question. The suggested answer is “Designate a DISTSTYLE KEY (destination) distribution for the Trips table and sort by delivery time” . However imagine you have imbalance of destinations (which is almost guaranteed – i.e. a city airport will be the most popular). All airport data will be stored on the same node and it’ll become a bottleneck. Whereas is you do even, each node will get more or less even load.
Here is AWS video that I refer https://www.youtube.com/watch?v=TFLoCLXulU0 (watch 13.30 onwards)
====
A food delivery service startup has thousands of riders that serve hundreds of thousands of customers every day. The number of users is expected to increase due to the effect of the pandemic. As a response, the company’s Data Analyst has decided to move the existing data to Amazon Redshift with the following schema:
A trips fact table that contains details about completed deliveries.
A riders dimension table for rider profiles.
A customer fact table for customer profiles.
The Data Analyst wants to evaluate profitability by analyzing the delivery date and time as well as the destination of each trip. The rider’s data almost don’t change while the customer’s data changes frequently.
How should the Data Analyst design the table to achieve optimal query performance?
- This discussion was modified 3 years ago by Klimok.
-
Hello Klimok,
Thanks for your feedback.
“All airport data will be stored on the same node and it’ll become a bottleneck”
>> DISTSTYLE KEY distributes data based on the DISTKEY values. Saying that all data will be stored on the same node means that you have data coming from exactly 1 person/entity, which is not usually the case for real-world data. Skewness in distribution can’t be totally eliminated so you should pick a DISTKEY value that would cause the least amount of skewness.
Queries with a DISTKEY/SORTKEY give better query performance than those without them. And you can’t do that with DISTYLE EVEN. DISTYLE EVEN does not care for column values. Imagine if you need to query data from a user. There’s a high chance that the user’s data is spread across different nodes, which may cause a network bottleneck because data must be aggregated from multiple compute nodes.
Let me know what you think.
Regards,
Carlo @ Tutorials Dojo
- This reply was modified 3 years ago by Carlo-TutorialsDojo.
-
Ok, thanks for explanations. If I understand you correctly, your point is that skewed KEY performance is still better than totally randomized EVEN.
The forum ‘AWS Certified Data Analytics – Specialty’ is closed to new discussions and replies.