Athena partitioning question may be wrong

Athena partitioning question may be wrong

Carlo-TutorialsDojo updated 2 years, 8 months ago 3 Members · 5 Posts
AWS Certified Data Analytics – Specialty
Timothy Teoh

Member
July 19, 2021 at 11:52 pm
Great content for this exam, wish there were more unique questions.

I have doubts about the answer to this:

“A smart home automation firm performs near real-time data analysis of data collected through an Amazon Kinesis Data Firehose delivery stream. The data is generated from 100 unique devices and then stored in an Amazon S3 bucket in JSON format that works as a data lake. Every night at 12:00 AM, the data is loaded for processing.“

The suggested answer says:

“In Kinesis Firehose, the default prefix is already based on year, month, day and hour”.

However in the AWS docs it states that the KDF format is not compatible with partitioning:

https://aws.amazon.com/blogs/big-data/amazon-kinesis-data-firehose-custom-prefixes-for-amazon-s3-objects/

Additionally, the question has no mention about any need to query by device, as it only states “changes over time”. The solution explanation also says “A common practice is to partition the data based on time, often leading to a multi-level partitioning scheme. For example, a customer who has data coming in every hour might decide to partition by year, month, date, and hour.”

While data is LOADED into S3 daily, it does not mean that the source data also needs to fall into daily buckets. Since the question states “near real time analysis” it is reasonable to think that an hourly grouping would be relevant.

Therefore both these other answers:

– “Configure the new delivery stream to use a CUSTOM prefix based on year, month, day, and hour”, and

– “In Athena, create the external table and partition it by year, month, day, and hour”

should be acceptable as well from the information given in the question.
- This discussion was modified 2 years, 9 months ago by Timothy Teoh.
Carlo-TutorialsDojo

Member
July 20, 2021 at 6:15 am

Hello Timothy,

Thanks for your feedback.

You have a valid point. I think this makes the question quite confusing:

“near real-time data analysis”

>>I think this should be near-real-time streaming or better yet, processed in batch since we want to study data changes over time. Hence, no need for near-real-time streaming.

Since there are unique devices with different readings, it would make more sense to partition data by the device id and the date. Say I want to graph the readings over time from device 1, if we partitioned by date, we would also be getting readings from other devices that we don’t need. If you query a partitioned table and specify the partition in the WHERE clause, Athena scans the data only from that partition.

Let me know what you think,

Regards,

Carlo @ Tutorials Dojo
- Timothy Teoh
  
  Member
  July 20, 2021 at 7:01 pm
  
  Possibly, perhaps some more specific wording around the use case would be helpful! I think the thing is that having gone through the Firehose FAQs, the point about needing custom prefixes for partitioning stood out, so that appeared to be what was being tested.
  
  Just sat the exam and there was also a question about this with YMD partitioning vs device-YMD partitioning, but it called out that the question was on which was more cost effective, so I did choose device-YMD partitioning there!
Klimok

Member
August 13, 2021 at 10:22 pm
I think this question still needs tweaking.

Partitioning by device assumes you’re interesting in a subset of devices rather than specific time horizons. However the questions mentions “study data changes over time that are stored in those files” so partitioning by time looks more suitable.

==

Also, I’d like to confirm why you refer to storage cost reduction when discussing partitioning – the idea is that grouping device’s data into a single daily file provides better compression vs saving it into 24 hourly buckets?
- This reply was modified 2 years, 8 months ago by Klimok.
- Carlo-TutorialsDojo
  
  Member
  August 14, 2021 at 12:06 am
  Hello Klimok,
  
  Thanks for your feedback.
  
  Assuming that we need to graph data changes over time captured from all devices (not just subset), If you partition the data coming from each device by date, then you must query and use joins to aggregate the data points from all partitions for a single device. Compare that to partitioning data by the device id, you only need to query a partition to retrieve all data points transmitted by a particular device. Same amount of data are scanned, but the latter is more cost-effective.
  
  Regards,
  
  Carlo @ Tutorials Dojo
  - This reply was modified 2 years, 8 months ago by Carlo-TutorialsDojo.
  - This reply was modified 2 years, 8 months ago by Carlo-TutorialsDojo.

Viewing 1 - 3 of 3 replies

Athena partitioning question may be wrong

Timothy Teoh

Carlo-TutorialsDojo

Timothy Teoh

Klimok

Carlo-TutorialsDojo