Guided Lab: Performing SQL Queries on DynamoDB with Amazon Athena

Description

Amazon DynamoDB is a fully managed NoSQL database designed for fast and predictable performance. It is ideal for high-scale applications that require consistent, low-latency data access. While DynamoDB is excellent for operational data storage and quick access patterns, it lacks native support for SQL-based querying, which can limit its use in complex analytics.

On the other hand, Amazon Athena is a serverless, interactive query service that allows you to analyze data in Amazon S3 using standard SQL. Athena is highly cost-effective for ad-hoc querying and is optimized for analytical tasks, making it ideal for use cases requiring SQL compatibility.

Using Athena to query DynamoDB data in S3 provides several benefits:

  • SQL Compatibility: Athena allows you to use SQL for ad-hoc analysis, which is more flexible than DynamoDB’s NoSQL model. Complex queries like joins, aggregations, and filtering are more straightforward with SQL, making Athena a powerful tool for analytics.
  • Cost-Efficient Analysis: DynamoDB’s read-and-write capacity model can be costly for large-scale or frequent analytical queries. With S3’s low-cost storage and Athena’s per-query pricing, you can save costs, especially for big data analysis.
  • Data Lake Integration: By exporting data to S3, you can integrate DynamoDB data with other datasets in a data lake. This allows you to combine multiple data sources for deeper insights and make the data accessible to other AWS services and tools.
  • Reduced Load on DynamoDB: Analytical queries in Athena avoid straining your DynamoDB tables, improving your application’s overall performance. This separation of transactional and analytical workloads also simplifies data management.

Consider an e-commerce platform that stores order transactions in DynamoDB. The business team wants to analyze sales trends and generate monthly reports. With Athena, they can query exported DynamoDB data in S3 using SQL, create insights without impacting the main database, and integrate it with other sources for a more comprehensive analysis.

In this lab, you’ll learn to transfer data from DynamoDB to Amazon S3 and query it with Athena using an AWS Glue Python Shell job. By offloading the data to S3 and using Athena, you can keep DynamoDB focused on real-time application workloads while leveraging Athena’s SQL capabilities for analytics. This approach lets you maintain optimal operational and analytical workload performance, enhancing cost efficiency and flexibility.

Prerequisites

This lab assumes you have a basic understanding and knowledge of AWS lambda DynamoDB, S3, Glue, and Athena.

If you find any gaps in your knowledge, consider taking the following lab:

Objectives

By the end of this lab, you will:

  • Set up a DynamoDB table and generate sample data using AWS Lambda.
  • Export data from DynamoDB to Amazon S3 using a Python Shell script in an AWS Glue job.
  • Configure AWS Glue to transform and prepare data in S3 for querying.
  • Use Amazon Athena to query the data stored in S3.

Lab Steps

Create a DynamoDB Table

1. Navigate to DynamoDB.

2. Create a Table using the following Configuration:

  • Table Name: Enter a unique name for your table, e.g., CustomerOrders.
  • Primary Key: Set up a primary key for the table, such as:
    • Partition Key: id (String)
  • Leave other settings as default for now, unless you have specific throughput or encryption needs.

3. Click Create Table to finalize the process. Then, wait until the table status changes to “Active.”

4. Once the table is created, note its name, as you’ll need it for the Lambda function.

Create an AWS Lambda Function for Data Generation

1. Go to Lambda in the AWS Console and click Create Function.

2. Function Setup:

  • Choose Author from scratch.
  • Function name: Name it something descriptive, like DynamoDBDataGenerator.
  • Runtime: Select Python 3.x (e.g., Python 3.12).
  • Execution Role: Choose Use an existing role, and select the PlayCloud-Sanbox.

3. Click Create Function.

4. Replace the default code with the following Python script in the Function code section. This code uses the boto3 library to interact with DynamoDB and generate sample data.

import boto3
import random
from datetime import datetime, timedelta

# Initialize DynamoDB
dynamodb = boto3.resource('dynamodb')
table_name = '<your_table_name_for_DynamoDB_Data>'  # Replace with your table name
table = dynamodb.Table(table_name)

# Sample data lists for generating realistic names, cities, and emails
first_names = ["John", "Jane", "Alice", "Bob", "Charlie", "Diana"]
last_names = ["Smith", "Johnson", "Williams", "Brown", "Jones", "Garcia"]
cities = ["New York", "Los Angeles", "Chicago", "Houston", "Phoenix", "Philadelphia"]

def generate_random_email(first_name, last_name):
    domains = ["example.com", "email.com", "testmail.com", "fakemail.com"]
    domain = random.choice(domains)
    return f"{first_name.lower()}.{last_name.lower()}@{domain}"

def generate_random_date():
    start_date = datetime.now() - timedelta(days=365)  # up to one year ago
    random_days = random.randint(0, 365)
    return (start_date + timedelta(days=random_days)).isoformat()

def lambda_handler(event, context):
    # Specify the number of records to generate
    num_records = 100  # Adjust this as needed

    for _ in range(num_records):
        first_name = random.choice(first_names)
        last_name = random.choice(last_names)
        city = random.choice(cities)

        item = {
            'id': str(random.randint(1, 1000000)),
            'name': f"{first_name} {last_name}",
            'email': generate_random_email(first_name, last_name),
            'city': city,
            'created_at': generate_random_date()
        }

        # Insert the item into DynamoDB
        table.put_item(Item=item)

    return {
        'statusCode': 200,
        'body': f'{num_records} records successfully inserted into {table_name}.'
    }

Take your time to review the code:

  • Name Generation: Used predefined lists of first and last names to construct realistic names.
  • Email Generation: Created a helper function, generate_random_email, which generates emails by combining first and last names with a random domain.
  • Date Generation: Used datetime and timedelta to create random dates within the last year.

a. Update <your_table_name_for_DynamoDB_Data> in the code to match your DynamoDB table name.

b. Click Deploy to save the function.

c. Go to Configuration > General Configuration > Edit > Change the Timeout to 1 min.

d. Return to the code editor and click Test in the Lambda console to create a test event.

      • Event Name: Enter a name like Test.
      • Use the default settings for the event template and click Save.

6. Click Test again to run the function. It should insert 100 sample records into the DynamoDB table.

7. Go back to DynamoDB Console, select your table, and click on Explore table items to verify that the sample data has been added.

Set up two S3 Bucket

1. Navigate to the S3 service.

2. Create two S3 buckets

a. One for the DynamoDB data.

      • Bucket Name: Enter a unique name for your bucket (e.g., dynamodb-data-storage-<any-number-to-make-it-unique>).
      • Leave all other settings as default and click Create Bucket.

b. One for Athena query results.

      • Bucket Name: Enter a unique name for your bucket(e.g., athena-results-bucket-<any_number_to_make_it_unique>)
      • Leave all other settings as default and click Create Bucket.

Create an AWS Glue Job for Data Export

1. In the AWS Console, go to Glue

2. Navigate to ETL Jobs > Author and edit ETL jobs, > Script Editor > Under Engine select Python shell and select Start fresh as the Options > Click Create script.

3. Configure your ETL Job using the following setting:

  • Name: Enter a name, like DynamoDBToS3Export.
  • IAM Role: Choose an existing role, PlayCloud-Sandbox
  • Ensure the Data processing units are 1/16 DPU
  • Click Save

4. In the Script Editor: Enter the following code to export data from DynamoDB to S3.

import boto3
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from io import BytesIO

# Initialize boto3 resources and clients
dynamodb = boto3.resource('dynamodb')
s3 = boto3.client('s3')

# Define DynamoDB and S3 details
table_name = '<your-DynamoDB-table-name>'  # Replace with your DynamoDB table name
output_bucket = '<your-S3-bucket-name>'  # Replace with your S3 bucket name
output_key = 'parquet-output/'  

# Initialize DynamoDB Table
table = dynamodb.Table(table_name)

# Step 1: Scan DynamoDB Table and Retrieve Data
def scan_dynamodb_table(table):
    response = table.scan()
    data = response['Items']
    
    # Handle pagination
    while 'LastEvaluatedKey' in response:
        response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
        data.extend(response['Items'])
    
    return data

# Retrieve data from DynamoDB
data = scan_dynamodb_table(table)

# Step 2: Convert Data to Pandas DataFrame
df = pd.DataFrame(data)

# Step 3: Convert DataFrame to Apache Parquet Format
table = pa.Table.from_pandas(df)
parquet_buffer = BytesIO()
pq.write_table(table, parquet_buffer)

# Step 4: Upload Parquet File to S3
parquet_buffer.seek(0)
s3.put_object(Bucket=output_bucket, Key=f"{output_key}dynamodb_data.parquet", Body=parquet_buffer)

print("Export from DynamoDB to Parquet format in S3 completed.")

Take your time to review the Code that exports data from a DynamoDB table to an S3 bucket in Parquet format using the following steps:

  1. Import Libraries:
    • Imports boto3 for AWS interactions, pandas for data manipulation, and pyarrow for Parquet file handling.
  2. Initialize AWS Clients:
    • Sets up a dynamodb resource to access the DynamoDB table and an s3 client to interact with S3.
  3. Define AWS Resources:
    • Specifies the DynamoDB table name and S3 bucket name (placeholders <your-DynamoDB-table-name> and <your-S3-bucket-name> should be replaced with actual values).
  4. Scan DynamoDB Table:
    • The scan_dynamodb_table function scans the table and retrieves all items.
    • If the scan results are paginated, the code loops through all pages until all data is fetched.
  5. Convert Data to DataFrame:
    • The data is loaded into a pandas DataFrame for easier manipulation and compatibility with PyArrow.
  6. Convert DataFrame to Parquet Format:
    • Converts the DataFrame to a PyArrow Table and writes it to an in-memory buffer as a Parquet file.
  7. Upload Parquet to S3:
    • Uploads the Parquet file from the buffer to the specified S3 bucket and key location.
  8. Completion Message:
    • Prints a message indicating that the export is complete.

a. Update<your-DynamoDB-table-name> and <your-S3-bucket-name> in the code to match your resources.

b. Click Save and Run the Glue job manually to export data to S3. Monitor the logs in the RUN tab to confirm the export was successful.

Create the Athena Table

1. Navigate to Amazon Athena Dashboard

2. Before executing any queries, you need to specify an S3 bucket location where query results will be saved.

a. Click on the Edit setting

b. Browse and select the S3 Bucket folder athena-results, and click on Save

3. Return to the Query Editor, and run the following SQL query in the Athena query editor to create a new database:

CREATE DATABASE customer_order_db;

4. Define the table schema with a CREATE EXTERNAL TABLE statement pointing to your S3 data location:

CREATE EXTERNAL TABLE IF NOT EXISTS customer_order_db.dynamodb_data (
    id STRING,
    name STRING,
    email STRING,
    city STRING,
    created_at STRING
)
STORED AS PARQUET
LOCATION 's3://<your-DynamoDB-table-name>/parquet-output/';

This query creates an Athena table pointing to external data in S3. Instead of storing data within Athena, this table configuration enables Athena to query data stored in S3 like in a database table. This approach is often used for analytical processing because Athena can directly access large datasets stored in S3, facilitating SQL-based analysis on data that may have originated from DynamoDB or other sources.

After creating this external table, you can write SQL queries against customer_order_db.dynamodb_data in Athena to access and analyze the data stored in the specified S3 location.


4. Run the Query to create the table.

Query the Data in Athena

Once the table is created, you can use the following example queries to work with the data.

1.  Select All Records

SELECT * 
FROM customer_order_db.dynamodb_data
LIMIT 10;

This is a quick way to view a subset of records and confirm that the data was imported correctly. The LIMIT clause limits the output to the first 10 rows.


2. Filter Records by City

SELECT 
  id, 
  name, 
  email, 
  city, 
  created_at 
FROM customer_order_db.dynamodb_data
WHERE city = 'New York';

This query filters the results to show only customers from New York, which can be useful for regional analysis.


3. Count Customers by City

SELECT 
  city, 
  COUNT(*) AS total_customers
FROM customer_order_db.dynamodb_data
GROUP BY city
ORDER BY total_customers DESC;

This query gives an aggregated view of the data, showing the total number of customers per city. The results are ordered in descending order, so the cities with the most customers appear first.


Congratulations on completing this lab! You’ve learned to export data from DynamoDB to S3 and query it using Athena. This setup allows you to maintain DynamoDB’s performance for application needs while gaining SQL functionality and cost-efficiency for analytics through Athena. Using S3 as a storage layer, you can integrate data from multiple sources, enabling comprehensive data analysis.

As a final step, remember it’s always a best practice to clean up resources after completing a lab. Deleting unused DynamoDB tables, S3 buckets, Glue jobs, and other resources will help maintain a clutter-free AWS environment.

Happy learning!!!

Skip to content