Guided Lab: Creating Amazon Athena Partitioned Tables
Description
Amazon Athena allows for querying data stored in Amazon S3 using SQL, making it ideal for large datasets. However, scanning non-partitioned datasets can result in unnecessary costs and slower query times. Partitioning tables in Athena helps optimize queries by scanning only the relevant data partitions, leading to more efficient processing and cost savings.
Use Case Example: Consider a retail chain that stores daily transaction records from multiple locations in CSV format on Amazon S3. A data engineer uses Athena to analyze this data, but since only a subset of the data is queried, Athena scans the entire dataset, leading to inefficiencies and high costs. By partitioning the data by date (year, month) in a structured format, Athena scans only the relevant partitions, improving performance and reducing cost.
This lab will walk you through the process of creating partitioned tables in Amazon Athena, generating and uploading sample CSV data into Amazon S3 using an AWS Lambda function, and querying the partitioned data.
Prerequisites
This lab assumes you have a basic understanding and knowledge of Amazon S3 service, AWS Lambda, and Athena.
If you find any gaps in your knowledge, consider taking the following lab:
- Creating an Amazon S3 bucket
- Creating an AWS Lambda function
- Amazon Athena Data Querying and Table Creation
Objectives
In this lab, you will:
- Learn how to create partitioned tables in Amazon Athena.
- Simulate data generation using an AWS Lambda function that writes CSV files into a structured S3 bucket.
- Query the partitioned table to observe the impact on performance and cost.
Lab Steps
Create an S3 Bucket to Store Data
1. Navigate to the S3 service.
2. Create two S3 buckets
a. One for retail transaction data.
-
-
- Choose a unique bucket name (e.g., retail-transaction-data-lab-<any_number_to_make_it_unique>)
- Leave the remaining options as default.
- Click Create bucket.
-
b. One for Athena query results.
-
-
- Choose a unique bucket name (e.g., athena-results-bucket-<any_number_to_make_it_unique>)
- Leave the remaining options as default.
- Click Create bucket.
-
Set Up a Lambda Function to Generate and Upload CSV Files
1. Navigate to the AWS Lambda Console.
2. Create a new Lambda function using the following configurations:
- Choose Author from scratch.
- Function name: RetailDataGenerator
- Select Python 3.12 as the runtime.
- Execution role:
- Select Use an Existing Role: PlayCloud-Sandbox
- Click Create function
3. Set Up the Lambda Code:
a. Use the following Python code to generate and upload sample CSV data:
import boto3
import csv
from io import StringIO
from datetime import datetime
from dateutil.relativedelta import relativedelta
import random
s3_client = boto3.client('s3')
bucket_name = 'retail-transaction-data-lab-<unique-number>'
# Sample product names for generating data
product_names = [
"Widget A", "Widget B", "Gadget C", "Gadget D",
"Thingamajig E", "Contraption F", "Device G", "Apparatus H"
]
def generate_sample_data(transaction_date, num_records):
"""Generate sample transaction data for a given date."""
data = []
for _ in range(num_records):
transaction_id = f'TRAN{random.randint(1000, 9999)}'
store_id = f'ST{str(random.randint(1, 10)).zfill(3)}'
product_id = f'P{str(random.randint(1, 50)).zfill(3)}'
product_name = random.choice(product_names)
quantity = random.randint(1, 10)
price = f'{random.uniform(5.0, 100.0):.2f}'
# Appending generated data as a row
data.append([transaction_id, store_id, product_id, product_name, quantity, price, transaction_date])
return data
def create_partition_path(year, month):
"""Return the S3 path for the given year and month."""
return f'year={year}/month={month}/'
def upload_csv_to_s3(data, year, month, write_header=True):
"""Create and upload the CSV to S3 for a given partition."""
# Create the CSV in memory
csv_buffer = StringIO()
writer = csv.writer(csv_buffer)
# Write the header only if specified
if write_header:
writer.writerow(['transaction_id', 'store_id', 'product_id', 'product_name', 'quantity', 'price', 'transaction_date'])
# Write the data rows
writer.writerows(data)
# Generate the S3 partition path
partition_path = create_partition_path(year, str(month).zfill(2))
# Upload CSV to S3
s3_client.put_object(Bucket=bucket_name, Key=partition_path + 'transactions.csv', Body=csv_buffer.getvalue())
print(f'Uploaded CSV to s3://{bucket_name}/{partition_path}transactions.csv')
def lambda_handler(event, context):
# Optional: Seed the random number generator for reproducibility
random.seed(42)
# Set the number of months and records to generate
months_to_generate = 12 # Last 12 months
records_per_month = random.randint(50, 150) # Generate between 50 and 150 records
current_date = datetime.now()
for month_offset in range(months_to_generate):
# Calculate the date and year/month once for this month
transaction_date = (current_date - relativedelta(months=month_offset)).strftime('%Y-%m-%d')
year = (current_date - relativedelta(months=month_offset)).year
month = (current_date - relativedelta(months=month_offset)).month
# Generate dynamic transaction data
data = generate_sample_data(transaction_date, records_per_month)
# Upload the generated CSV data to S3
upload_csv_to_s3(data, year, month)
return {
'statusCode': 200,
'body': 'Uploaded transaction data for multiple months.'
}
Note: Replace <unique-number> in bucket_name with your unique bucket number.
Take your time to review the code:
This Python code generates sample transaction data and uploads it to an Amazon S3 bucket in CSV format. The uploaded data is organized into partitions based on the year and month, making it suitable for tools like AWS Athena, which can query partitioned data.
Test the Lambda Function
1. Once your function is deployed. Click the arrow dropdown of the Test button
2. Click on Configure test event, and follow the configuration below:
- Event name: Test
- Template- optional: hello-world
- Leave the rest as default
- Click on Save
3. Now, click on Test to execute the function.
- Take your time to review the result:
4. Check your S3 Bucket; the files should be uploaded there
-
The file structure should look like the following example
s3://retail-transaction-data-lab-3333/
├── year=2023/
│ ├── month=01/
│ │ └── transactions.csv
├── year=2024/
│ ├── month=01/
│ │ └── transactions.csv
│ └── ...
- retail_transactions/: Root folder containing all transaction data.
- year=YYYY/: Subfolders representing each year.
- month=MM/: Subfolders representing each month within a year.
- transactions.csv: The CSV file for each month’s transaction data.
Create a Partitioned Table in Athena
1. Navigate to Amazon Athena Dashboard
2. Before executing any queries, you need to specify an S3 bucket location where query results will be saved.
a. Click on the Edit setting
b. Browse and select on the S3 Bucket folder athena-results, and click on Save
3. Run the following SQL query in the Athena query editor to create a new database:
CREATE DATABASE retail_data_db;
4. Run the following SQL query to create a table that reflects the partitioned structure:
CREATE EXTERNAL TABLE IF NOT EXISTS retail_data_db.retail_transactions (
transaction_id STRING,
store_id STRING,
product_id STRING,
quantity BIGINT,
price DOUBLE,
transaction_date STRING
)
PARTITIONED BY (year STRING, month STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3://retail-transaction-data-lab-3333/'
TBLPROPERTIES ('has_encrypted_data'='false')
- CREATE EXTERNAL TABLE: Create an external table in Athena that references data stored in S3.
- PARTITIONED BY (year STRING, month STRING): Specifies that the table is partitioned by year and month.
- LOCATION: Points to the S3 location where the data is stored.
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’: Specifies that the data is in CSV format, with fields separated by commas.
3. Add Partitions:
-
Once the table is created, add partitions to the table so that Athena knows how to read data from the partitioned structure. Use the following command:
MSCK REPAIR TABLE retail_data_db.retail_transactions;
Athena supports Apache Hive-style partitioning, where the folder structure in S3 includes key-value pairs separated by equal signs (for example, country=us/… or year=2021/month=01/day=26/…). These folder paths represent both the partition keys and their corresponding values. The MSCK REPAIR TABLE command loads new partitions into a partitioned table that follows this Hive-style format. This command automatically detects and adds the partitions to the table’s metadata but only works with Hive-style partition structures. For more information, see DOCUMENTATION HERE
4. Query the Partitioned Data: You can now query specific partitions of the table using SQL. For example:
a. Retrieve All Transactions from a Specific Year and Month:
SELECT *
FROM retail_data_db.retail_transactions
WHERE year = '2024' AND month = '01';
This query retrieves all transactions from January 2024 by targeting the specific partition (year = ‘2024’ and month = ’01’). Since Athena scans only the matching partition, the query is more efficient than scanning the entire dataset. Change year and month to check other data if you desire.
b. Count Total Transactions for Each Month Across Years:
SELECT year, month, COUNT(*) AS total_transactions
FROM retail_data_db.retail_transactions
GROUP BY year, month
ORDER BY year, month;
This query counts the total number of transactions for each month. The query groups the results by year and month to count transactions per partition, providing insights into transaction volumes over time.
c. Get Total Revenue for a Specific Month:
SELECT SUM(price * quantity) AS total_revenue
FROM retail_data_db.retail_transactions
WHERE year = '2024' AND month = '01';
This query calculates the total revenue for January 2024 by multiplying the price and quantity for all records in the 2024/01 partition. Change year and month to check other data if you desire.
d. Find the Top 5 Products Sold in a Specific Month:
SELECT product_name, SUM(quantity) AS total_quantity_sold
FROM retail_data_db.retail_transactions
WHERE year = '2024' AND month = '01'
GROUP BY product_name
ORDER BY total_quantity_sold DESC
LIMIT 5;
This query identifies the top 5 products sold in January 2024 by aggregating the quantity sold and sorting the results in descending order.
e. Filter Transactions by Date Range:
SELECT *
FROM retail_data_db.retail_transactions
WHERE (year = '2024' AND month >= '01' AND month <= '03');
This query retrieves all transactions from January to March 2024. Uses BETWEEN to specify the range of months. Since the table is partitioned by year and month, only relevant partitions are scanned. Change year and month to check other data if you desire.
Congratulations on completing the lab! You’ve successfully learned how to create partitioned tables in Athena, which improves query efficiency and reduces costs by scanning only relevant data. This optimization is essential for handling large datasets efficiently, like in the retail example, ensuring quicker queries and better cost management. Explore additional techniques like compression and Parquet formats to enhance your AWS data processing.
As a final step, remember it’s always a best practice to clean up resources after completing a lab. Deleting unused resources will help maintain a clutter-free AWS environment.
Happy Learning!