Redshift DISTYLE KEY

Redshift DISTYLE KEY

Klimok updated 2 years, 8 months ago 2 Members · 3 Posts
AWS Certified Data Analytics – Specialty
Klimok

Member
August 12, 2021 at 10:19 am
Hi, I’d like to challenge the below question. The suggested answer is “Designate a DISTSTYLE KEY (destination) distribution for the Trips table and sort by delivery time” . However imagine you have imbalance of destinations (which is almost guaranteed – i.e. a city airport will be the most popular). All airport data will be stored on the same node and it’ll become a bottleneck. Whereas is you do even, each node will get more or less even load.

Here is AWS video that I refer https://www.youtube.com/watch?v=TFLoCLXulU0 (watch 13.30 onwards)

====

A food delivery service startup has thousands of riders that serve hundreds of thousands of customers every day. The number of users is expected to increase due to the effect of the pandemic. As a response, the company’s Data Analyst has decided to move the existing data to Amazon Redshift with the following schema:

A trips fact table that contains details about completed deliveries.

A riders dimension table for rider profiles.

A customer fact table for customer profiles.

The Data Analyst wants to evaluate profitability by analyzing the delivery date and time as well as the destination of each trip. The rider’s data almost don’t change while the customer’s data changes frequently.

How should the Data Analyst design the table to achieve optimal query performance?
- This discussion was modified 2 years, 8 months ago by Klimok.
Carlo-TutorialsDojo

Member
August 13, 2021 at 3:05 am
Hello Klimok,

Thanks for your feedback.

“All airport data will be stored on the same node and it’ll become a bottleneck”

>> DISTSTYLE KEY distributes data based on the DISTKEY values. Saying that all data will be stored on the same node means that you have data coming from exactly 1 person/entity, which is not usually the case for real-world data. Skewness in distribution can’t be totally eliminated so you should pick a DISTKEY value that would cause the least amount of skewness.

Queries with a DISTKEY/SORTKEY give better query performance than those without them. And you can’t do that with DISTYLE EVEN. DISTYLE EVEN does not care for column values. Imagine if you need to query data from a user. There’s a high chance that the user’s data is spread across different nodes, which may cause a network bottleneck because data must be aggregated from multiple compute nodes.

Let me know what you think.

Regards,

Carlo @ Tutorials Dojo
- This reply was modified 2 years, 8 months ago by Carlo-TutorialsDojo.
Klimok

Member
August 13, 2021 at 8:27 pm

Ok, thanks for explanations. If I understand you correctly, your point is that skewed KEY performance is still better than totally randomized EVEN.

Viewing 1 - 3 of 3 replies

The forum ‘AWS Certified Data Analytics – Specialty’ is closed to new discussions and replies.

Redshift DISTYLE KEY

Klimok

Carlo-TutorialsDojo

Klimok