Data & Analytics
Data lakes, streaming analytics, Glue, Athena, and data architecture patterns
AWS Data & Analytics Services
Master AWS data and analytics with free flashcards and hands-on practice. This lesson covers Amazon Redshift, AWS Glue, Amazon Athena, Amazon Kinesis, and AWS Lake Formationβessential services for building scalable data pipelines and analytics solutions on AWS.
Welcome to AWS Data & Analytics π
The AWS data and analytics ecosystem offers a comprehensive suite of services for ingesting, storing, processing, and analyzing data at any scale. Whether you're building a real-time streaming application, a massive data warehouse, or a serverless query engine, AWS provides purpose-built tools that integrate seamlessly.
In this lesson, you'll learn how to architect modern data solutions using AWS managed services, understand when to use each tool, and explore real-world implementation patterns that power today's data-driven applications.
Core Concepts
π’ Amazon Redshift: Cloud Data Warehouse
Amazon Redshift is a fully managed, petabyte-scale data warehouse service designed for Online Analytical Processing (OLAP) workloads. Unlike transactional databases, Redshift is optimized for complex queries across massive datasets.
Key Architecture Components:
βββββββββββββββββββββββββββββββββββββββββββ
β REDSHIFT CLUSTER β
βββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββββββ β
β β Leader ββββββ Compute β β
β β Node β β Node 1 β β
β β β ββββββββββββββββ β
β β Query β ββββββββββββββββ β
β β Planner ββββββ Compute β β
β β β β Node 2 β β
β ββββββββββββ ββββββββββββββββ β
β β ββββββββββββββββ β
β βββββββββββ Compute β β
β β Node N β β
β ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββ
β β
SQL Queries Results
Columnar Storage: Redshift stores data in columns rather than rows, which dramatically improves query performance for analytical workloads:
| Row Storage (OLTP) | Column Storage (OLAP) |
|---|---|
| ID|Name|Age|City | ID: 1,2,3,4... |
| 1|Alice|30|NYC | Name: Alice,Bob,Carol... |
| 2|Bob|25|LA | Age: 30,25,35... |
| 3|Carol|35|SF | City: NYC,LA,SF... |
π‘ Tip: When querying "SELECT AVG(Age) FROM users", columnar storage only reads the Age column, skipping all other data.
Distribution Styles: Redshift distributes data across compute nodes using three strategies:
- KEY: Distributes rows based on a specific column value (co-locates related data)
- ALL: Replicates entire table to all nodes (small dimension tables)
- EVEN: Round-robin distribution (default for tables without obvious join patterns)
- AUTO: Lets Redshift choose based on table size
CREATE TABLE sales (
sale_id INT,
product_id INT,
amount DECIMAL(10,2),
sale_date DATE
)
DISTSTYLE KEY
DISTKEY (product_id)
SORTKEY (sale_date);
Redshift Spectrum: Query data directly in S3 without loading it into Redshiftβperfect for infrequently accessed data or data lake integration.
π AWS Glue: Serverless ETL
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare data for analytics. It's serverlessβyou don't provision or manage infrastructure.
Key Components:
ββββββββββββββββββββββββββββββββββββββββββββββββ β AWS GLUE ARCHITECTURE β ββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β π Data Catalog β β ββββββββββββββββββββββββββββββββββββββ β β β Metadata Store β β β β β’ Databases β β β β β’ Tables β β β β β’ Schemas β β β ββββββββββββββββββββββββββββββββββββββ β β β β β β β β β β π Crawlers π§ ETL Jobs β β ββββββββββββ βββββββββββββββββββ β β β Scan S3 β β PySpark/Python β β β β Scan RDS β β Transform Data β β β β Infer β β Write Output β β β β Schema β β β β β ββββββββββββ βββββββββββββββββββ β β β β β β π¦ Data Sources β π Data Targets β β (S3, RDS, etc.) (S3, Redshift, etc.) β ββββββββββββββββββββββββββββββββββββββββββββββββ
Glue Data Catalog: A centralized metadata repository that integrates with:
- Amazon Athena
- Amazon Redshift Spectrum
- Amazon EMR
- AWS Lake Formation
Glue Crawlers: Automatically scan data sources and populate the Data Catalog:
import boto3
glue = boto3.client('glue')
response = glue.create_crawler(
Name='s3-sales-crawler',
Role='AWSGlueServiceRole-CrawlerAccess',
DatabaseName='sales_db',
Targets={
'S3Targets': [
{
'Path': 's3://my-bucket/sales-data/'
}
]
},
SchemaChangePolicy={
'UpdateBehavior': 'UPDATE_IN_DATABASE',
'DeleteBehavior': 'LOG'
}
)
Glue ETL Jobs: Write transformations in PySpark or Python:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
## Read from Data Catalog
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database="sales_db",
table_name="raw_sales"
)
## Transform data
applymapping1 = ApplyMapping.apply(
frame=datasource0,
mappings=[
("customer_id", "string", "customer_id", "int"),
("amount", "string", "amount", "decimal"),
("date", "string", "sale_date", "date")
]
)
## Write to S3 in Parquet format
datasink4 = glueContext.write_dynamic_frame.from_options(
frame=applymapping1,
connection_type="s3",
connection_options={"path": "s3://output-bucket/processed/"},
format="parquet"
)
job.commit()
π‘ Tip: Use Glue DataBrew for visual, no-code data preparation if you're not comfortable writing ETL scripts.
π Amazon Athena: Serverless Query Service
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. You pay only for the queries you runβno infrastructure to manage.
Architecture:
ββββββββββββββββββββββββββββββββββββββββββ β AMAZON ATHENA β ββββββββββββββββββββββββββββββββββββββββββ€ β β β π Query Editor β β ββββββββββββββββββββββββββββββββ β β β SELECT * FROM logs β β β β WHERE status = 500 β β β β AND date = '2024-01-15' β β β ββββββββββββββββββββββββββββββββ β β β β β βοΈ Presto Engine β β ββββββββββββββββββββββββββββββββ β β β Parse β Optimize β Execute β β β ββββββββββββββββββββββββββββββββ β β β β β π Glue Data Catalog β β (Table Metadata) β β β β β π¦ Amazon S3 β β (Actual Data: CSV, JSON, Parquet) β ββββββββββββββββββββββββββββββββββββββββββ
Creating Tables:
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
date_time STRING,
edge_location STRING,
bytes_sent BIGINT,
request_ip STRING,
method STRING,
host STRING,
uri STRING,
status INT,
user_agent STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex' = '^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+(.+)$'
)
LOCATION 's3://my-bucket/cloudfront-logs/'
TBLPROPERTIES ('skip.header.line.count'='2');
Partitioning for Performance: Partitions dramatically reduce query costs by limiting data scanned:
CREATE EXTERNAL TABLE sales_partitioned (
transaction_id STRING,
amount DECIMAL(10,2),
product_id STRING
)
PARTITIONED BY (
year INT,
month INT,
day INT
)
STORED AS PARQUET
LOCATION 's3://my-bucket/sales/';
-- Add partitions
ALTER TABLE sales_partitioned ADD PARTITION (year=2024, month=1, day=15)
LOCATION 's3://my-bucket/sales/year=2024/month=1/day=15/';
-- Query only specific partition
SELECT SUM(amount)
FROM sales_partitioned
WHERE year=2024 AND month=1 AND day=15;
Optimization Tips:
| Strategy | Impact | When to Use |
|---|---|---|
| Use columnar formats (Parquet, ORC) | 90% cost reduction | Always for analytics |
| Partition data | 80-95% cost reduction | Time-series or categorical data |
| Compress files | 30-70% cost reduction | Large datasets |
| Limit SELECT columns | Varies widely | Wide tables with many columns |
β οΈ Common Mistake: Storing data in JSON or CSV without compression. A 10GB JSON file might scan $0.50 per query, while the same data in compressed Parquet costs $0.05.
π Amazon Kinesis: Real-Time Streaming
Amazon Kinesis enables real-time processing of streaming data at scale. It consists of four services:
ββββββββββββββββββββββββββββββββββββββββββββββββ β KINESIS SERVICE FAMILY β ββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β π Kinesis Data Streams β β ββββββββββββββββββββββββββββββββββββββ β β β Real-time data ingestion β β β β β’ Custom processing (Lambda, EC2) β β β β β’ Replay capability β β β β β’ 1-7 days retention β β β ββββββββββββββββββββββββββββββββββββββ β β β β π₯ Kinesis Data Firehose β β ββββββββββββββββββββββββββββββββββββββ β β β Load streaming data to storage β β β β β’ Auto-scaling β β β β β’ Transform with Lambda β β β β β’ Destinations: S3, Redshift, ES β β β ββββββββββββββββββββββββββββββββββββββ β β β β π Kinesis Data Analytics β β ββββββββββββββββββββββββββββββββββββββ β β β SQL on streaming data β β β β β’ Real-time analytics β β β β β’ Anomaly detection β β β ββββββββββββββββββββββββββββββββββββββ β β β β πΉ Kinesis Video Streams β β ββββββββββββββββββββββββββββββββββββββ β β β Video streaming and processing β β β β β’ ML inference β β β β β’ Playback β β β ββββββββββββββββββββββββββββββββββββββ β ββββββββββββββββββββββββββββββββββββββββββββββββ
Kinesis Data Streams Architecture:
Producers Kinesis Stream Consumers
ββββββββββββββββ βββββββββββββββββββ ββββββββββββββββ
β IoT βββββ β Shard 1 βββββ β Lambda β
β Devices β β β [Records...] β β β Function β
ββββββββββββββββ β βββββββββββββββββββ β ββββββββββββββββ
ββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββββ
β Web β β β Shard 2 β β β EC2 β
β Servers βββββ€ β [Records...] β β β App β
ββββββββββββββββ β βββββββββββββββββββ β ββββββββββββββββ
ββββββββββββββββ β βββββββββββββββββββ β ββββββββββββββββ
β Mobile βββββ β Shard N βββββ β Kinesis β
β Apps β β [Records...] β β Analytics β
ββββββββββββββββ βββββββββββββββββββ ββββββββββββββββ
Key Concepts:
- Shard: Base throughput unit (1 MB/sec input, 2 MB/sec output)
- Partition Key: Determines which shard receives the record
- Sequence Number: Unique identifier for each record in a shard
Producer Example:
import boto3
import json
import time
kinesis = boto3.client('kinesis', region_name='us-east-1')
def send_event(stream_name, data, partition_key):
response = kinesis.put_record(
StreamName=stream_name,
Data=json.dumps(data),
PartitionKey=partition_key
)
return response['SequenceNumber']
## Send clickstream data
event_data = {
'user_id': 'user-12345',
'action': 'page_view',
'page': '/products/widget-pro',
'timestamp': int(time.time())
}
seq_num = send_event(
stream_name='clickstream',
data=event_data,
partition_key=event_data['user_id']
)
print(f"Sent record: {seq_num}")
Consumer Example (Lambda):
import json
import base64
def lambda_handler(event, context):
for record in event['Records']:
# Kinesis data is base64 encoded
payload = base64.b64decode(record['kinesis']['data'])
data = json.loads(payload)
print(f"Processing: {data}")
# Business logic here
if data['action'] == 'purchase':
process_purchase(data)
elif data['action'] == 'page_view':
track_page_view(data)
return {'statusCode': 200, 'body': 'Processed successfully'}
π‘ Tip: Use Kinesis Data Firehose when you need simple delivery to S3, Redshift, or Elasticsearch without custom processing. It's fully managed and auto-scales.
ποΈ AWS Lake Formation: Data Lake Management
AWS Lake Formation makes it easy to set up a secure data lake in days instead of months. It handles data ingestion, cataloging, transformation, and security.
Data Lake Architecture:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β AWS LAKE FORMATION β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β π Centralized Security & Governance β β ββββββββββββββββββββββββββββββββββββββββββββββ β β β β’ Column-level permissions β β β β β’ Row-level security β β β β β’ Data access audit β β β ββββββββββββββββββββββββββββββββββββββββββββββ β β β β β π¦ Data Lake Storage (S3) β β ββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ β β β Raw β Curated βAnalytics β Archive β β β β Zone β Zone β Zone β Zone β β β ββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ β β β β β β β β π§ Data Transformation (Glue ETL) β β ββββββββββββββββββββββββββββββββββββββββββββββ β β β Raw β Clean β Aggregate β Optimize β β β ββββββββββββββββββββββββββββββββββββββββββββββ β β β β β π Analytics Services β β ββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ β β β Athena β Redshift β EMR βQuickSightβ β β β Spectrum β β β β β β ββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Setting Up a Data Lake:
import boto3
lakeformation = boto3.client('lakeformation')
## Register S3 location as data lake
response = lakeformation.register_resource(
ResourceArn='arn:aws:s3:::my-data-lake',
UseServiceLinkedRole=True
)
## Grant permissions
lakeformation.grant_permissions(
Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/DataAnalyst'},
Resource={
'Table': {
'DatabaseName': 'sales_db',
'Name': 'transactions'
}
},
Permissions=['SELECT'],
PermissionsWithGrantOption=[]
)
## Column-level security
lakeformation.grant_permissions(
Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/Marketing'},
Resource={
'TableWithColumns': {
'DatabaseName': 'sales_db',
'Name': 'customers',
'ColumnNames': ['customer_id', 'email', 'city']
# Excludes sensitive columns like SSN, credit_card
}
},
Permissions=['SELECT']
)
Data Lake Zones Pattern:
| Zone | Purpose | Format | Access |
|---|---|---|---|
| Raw/Landing | Original data from sources | Any (CSV, JSON, logs) | Data Engineers |
| Curated/Clean | Validated, cleaned data | Parquet, ORC | Data Scientists, Analysts |
| Analytics | Aggregated, business-ready | Parquet (partitioned) | Business Users, BI Tools |
| Archive | Historical data | Compressed Parquet | Compliance, Auditors |
Real-World Examples
Example 1: Building a Serverless Data Pipeline π°
Scenario: A retail company wants to analyze customer behavior from web clickstreams in real-time and generate daily reports.
Architecture:
ββββββββββββββ ββββββββββββββββ ββββββββββββ
β Website βββββββ Kinesis βββββββ Lambda β
β (Events) β β Data Firehoseβ β(Transform)β
ββββββββββββββ ββββββββββββββββ ββββββ¬ββββββ
β
βββββββββββββββββββββββββββββββ
β S3 Data Lake β
β /raw/clickstream/ β
β /processed/clickstream/ β
ββββββββββββ¬βββββββββββββββββββ
β
βββββββββββββββββ΄ββββββββββββββββ
β β
ββββββββββββββββββ ββββββββββββββββ
β Glue Crawler β β Athena β
β (Catalog Data) β β (SQL Queries)β
ββββββββββββββββββ ββββββββ¬ββββββββ
β
ββββββββββββββββββ
β QuickSight β
β (Dashboard) β
ββββββββββββββββββ
Implementation:
## Step 1: Configure Firehose delivery stream
import boto3
firehose = boto3.client('firehose')
firehose.create_delivery_stream(
DeliveryStreamName='clickstream-to-s3',
DeliveryStreamType='DirectPut',
S3DestinationConfiguration={
'RoleARN': 'arn:aws:iam::123456789012:role/FirehoseRole',
'BucketARN': 'arn:aws:s3:::my-data-lake',
'Prefix': 'raw/clickstream/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/',
'ErrorOutputPrefix': 'errors/clickstream/',
'BufferingHints': {
'SizeInMBs': 5,
'IntervalInSeconds': 300
},
'CompressionFormat': 'GZIP',
'ProcessingConfiguration': {
'Enabled': True,
'Processors': [
{
'Type': 'Lambda',
'Parameters': [
{
'ParameterName': 'LambdaArn',
'ParameterValue': 'arn:aws:lambda:us-east-1:123456789012:function:EnrichClickstream'
}
]
}
]
}
}
)
## Step 2: Lambda transformation function
def lambda_handler(event, context):
import json
import base64
from datetime import datetime
output = []
for record in event['records']:
payload = json.loads(base64.b64decode(record['data']))
# Enrich with additional fields
payload['processed_at'] = datetime.utcnow().isoformat()
payload['user_agent_parsed'] = parse_user_agent(payload.get('user_agent', ''))
# Transform and encode
output_record = {
'recordId': record['recordId'],
'result': 'Ok',
'data': base64.b64encode(json.dumps(payload).encode('utf-8')).decode('utf-8')
}
output.append(output_record)
return {'records': output}
## Step 3: Query with Athena
athena_query = """
SELECT
date_trunc('hour', timestamp) as hour,
page,
COUNT(*) as page_views,
COUNT(DISTINCT user_id) as unique_visitors
FROM clickstream
WHERE year = 2024 AND month = 1 AND day = 15
GROUP BY 1, 2
ORDER BY page_views DESC;
"""
Cost Optimization:
- Using Parquet with GZIP: ~$0.05 per GB scanned
- With partitioning by date: Only scan relevant day's data
- 1 TB daily data β 365 TB yearly β Query 1 day = only ~1 TB scanned
Example 2: Data Warehouse Modernization ποΈ
Scenario: Migrate from an on-premises Oracle data warehouse to Amazon Redshift.
Migration Strategy:
## Using AWS Database Migration Service (DMS)
import boto3
dms = boto3.client('dms')
## Create replication instance
dms.create_replication_instance(
ReplicationInstanceIdentifier='oracle-to-redshift',
ReplicationInstanceClass='dms.c5.xlarge',
AllocatedStorage=100,
EngineVersion='3.4.6'
)
## Create source endpoint (Oracle)
dms.create_endpoint(
EndpointIdentifier='oracle-source',
EndpointType='source',
EngineName='oracle',
ServerName='oracle.company.local',
Port=1521,
DatabaseName='PRODDB',
Username='dms_user',
Password='secure_password'
)
## Create target endpoint (Redshift)
dms.create_endpoint(
EndpointIdentifier='redshift-target',
EndpointType='target',
EngineName='redshift',
ServerName='my-redshift-cluster.abc123.us-east-1.redshift.amazonaws.com',
Port=5439,
DatabaseName='analytics',
Username='admin',
Password='redshift_password'
)
## Create migration task
dms.create_replication_task(
ReplicationTaskIdentifier='full-load-and-cdc',
SourceEndpointArn='arn:aws:dms:us-east-1:123456789012:endpoint:oracle-source',
TargetEndpointArn='arn:aws:dms:us-east-1:123456789012:endpoint:redshift-target',
ReplicationInstanceArn='arn:aws:dms:us-east-1:123456789012:rep:oracle-to-redshift',
MigrationType='full-load-and-cdc', # Full load + ongoing replication
TableMappings='''
{
"rules": [
{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "migrate-sales-tables",
"object-locator": {
"schema-name": "SALES",
"table-name": "%"
},
"rule-action": "include"
}
]
}
'''
)
Post-Migration Optimization:
-- Analyze table statistics for query planner
ANALYZE sales_fact;
-- Create materialized views for common queries
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT
DATE_TRUNC('day', sale_date) as day,
product_category,
SUM(amount) as total_sales,
COUNT(*) as transaction_count
FROM sales_fact
GROUP BY 1, 2;
-- Set up automatic refresh
CREATE MATERIALIZED VIEW mv_top_customers
AUTO REFRESH YES AS
SELECT
customer_id,
SUM(amount) as lifetime_value
FROM sales_fact
GROUP BY customer_id
ORDER BY lifetime_value DESC
LIMIT 1000;
-- Vacuum and analyze regularly (in maintenance window)
VACUUM SORT ONLY sales_fact;
VACUUM DELETE ONLY sales_fact;
ANALYZE sales_fact;
Example 3: Real-Time Anomaly Detection π¨
Scenario: Monitor IoT sensor data for anomalies using Kinesis Data Analytics.
SQL-based Stream Processing:
-- Create input stream from Kinesis Data Stream
CREATE OR REPLACE STREAM "SOURCE_SQL_STREAM" (
"sensor_id" VARCHAR(32),
"temperature" DOUBLE,
"pressure" DOUBLE,
"timestamp" TIMESTAMP
);
-- Detect anomalies using sliding window
CREATE OR REPLACE STREAM "ANOMALY_STREAM" (
"sensor_id" VARCHAR(32),
"current_temp" DOUBLE,
"avg_temp" DOUBLE,
"stddev_temp" DOUBLE,
"anomaly_score" DOUBLE,
"window_end" TIMESTAMP
);
CREATE OR REPLACE PUMP "ANOMALY_PUMP" AS
INSERT INTO "ANOMALY_STREAM"
SELECT STREAM
"sensor_id",
"temperature" as "current_temp",
AVG("temperature") OVER W as "avg_temp",
STDDEV("temperature") OVER W as "stddev_temp",
ABS("temperature" - AVG("temperature") OVER W) / STDDEV("temperature") OVER W as "anomaly_score",
STEP("timestamp" BY INTERVAL '1' MINUTE)
FROM "SOURCE_SQL_STREAM"
WINDOW W AS (
PARTITION BY "sensor_id"
RANGE INTERVAL '10' MINUTE PRECEDING
)
HAVING ABS("temperature" - AVG("temperature") OVER W) > 2 * STDDEV("temperature") OVER W;
-- Output to destination stream (triggers alerts via Lambda)
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
"sensor_id" VARCHAR(32),
"alert_type" VARCHAR(16),
"severity" VARCHAR(8),
"message" VARCHAR(256)
);
CREATE OR REPLACE PUMP "ALERT_PUMP" AS
INSERT INTO "DESTINATION_SQL_STREAM"
SELECT STREAM
"sensor_id",
'TEMPERATURE_ANOMALY' as "alert_type",
CASE
WHEN "anomaly_score" > 4 THEN 'CRITICAL'
WHEN "anomaly_score" > 3 THEN 'HIGH'
ELSE 'MEDIUM'
END as "severity",
'Temperature deviation: ' || CAST("anomaly_score" AS VARCHAR) || ' std devs' as "message"
FROM "ANOMALY_STREAM";
Example 4: Multi-Account Data Lake π
Scenario: Enterprise with multiple AWS accounts needs centralized data lake with granular access control.
Lake Formation Cross-Account Setup:
import boto3
## In the data lake account
lakeformation = boto3.client('lakeformation')
## Grant external account access to database
lakeformation.grant_permissions(
Principal={
'DataLakePrincipalIdentifier': 'arn:aws:iam::999999999999:root' # Consumer account
},
Resource={
'Database': {
'Name': 'shared_analytics_db'
}
},
Permissions=['DESCRIBE'],
PermissionsWithGrantOption=['DESCRIBE']
)
## Grant table-level permissions with column filtering
lakeformation.grant_permissions(
Principal={
'DataLakePrincipalIdentifier': 'arn:aws:iam::999999999999:role/DataScientist'
},
Resource={
'TableWithColumns': {
'DatabaseName': 'shared_analytics_db',
'Name': 'customer_transactions',
'ColumnNames': ['transaction_id', 'amount', 'product_id', 'timestamp'],
'ColumnWildcard': {
'ExcludedColumnNames': ['customer_ssn', 'credit_card_number']
}
}
},
Permissions=['SELECT']
)
## Row-level security with data filters
lakeformation.create_data_cells_filter(
TableData={
'DatabaseName': 'shared_analytics_db',
'TableName': 'customer_transactions',
'Name': 'region_filter_us_only',
'RowFilter': {
'FilterExpression': "region = 'US'"
}
}
)
Common Mistakes to Avoid β οΈ
1. Choosing the Wrong Service for the Use Case
β Wrong: Using Redshift for operational queries with high concurrency (thousands of QPS) β Right: Use RDS or DynamoDB for operational workloads; Redshift for analytical queries
β Wrong: Loading data into Kinesis Data Streams when you just need S3 delivery β Right: Use Kinesis Data Firehose for simple streaming to S3/Redshift
2. Poor Data Partitioning
-- β BAD: No partitioning
CREATE EXTERNAL TABLE logs (
timestamp STRING,
message STRING,
level STRING
)
LOCATION 's3://bucket/logs/';
-- Every query scans ALL data!
-- β
GOOD: Partitioned by date
CREATE EXTERNAL TABLE logs (
timestamp STRING,
message STRING,
level STRING
)
PARTITIONED BY (year INT, month INT, day INT)
LOCATION 's3://bucket/logs/';
-- Query scans only relevant partitions
3. Ignoring Redshift Distribution Keys
-- β BAD: No distribution strategy for large fact table
CREATE TABLE sales (
sale_id BIGINT,
product_id INT,
customer_id INT,
amount DECIMAL(10,2)
);
-- Massive data shuffling during joins!
-- β
GOOD: Distribute on frequently joined column
CREATE TABLE sales (
sale_id BIGINT,
product_id INT,
customer_id INT,
amount DECIMAL(10,2)
)
DISTSTYLE KEY
DISTKEY (product_id)
SORTKEY (sale_id);
-- Co-locates related data, minimizes shuffling
4. Not Using Columnar Formats
| Format | 10GB Dataset Query Cost | Query Performance |
|---|---|---|
| JSON (uncompressed) | $0.50 | Slow (full scan) |
| CSV (gzip) | $0.25 | Medium (compressed scan) |
| Parquet (snappy) | $0.05 | Fast (columnar, predicate pushdown) |
5. Over-Sharding Kinesis Streams
π‘ Remember: Each shard costs ~$0.015/hour ($10.80/month)
β Wrong: Provisioning 100 shards for 10 MB/sec throughput β Right: 10 shards handle 10 MB/sec (1 MB/sec per shard)
6. Not Monitoring Glue Job Metrics
## β
GOOD: Enable job metrics and bookmarks
glue.create_job(
Name='sales-etl',
Role='AWSGlueServiceRole',
Command={
'Name': 'glueetl',
'ScriptLocation': 's3://bucket/scripts/etl.py'
},
DefaultArguments={
'--enable-metrics': 'true',
'--enable-continuous-cloudwatch-log': 'true',
'--enable-job-insights': 'true',
'--job-bookmark-option': 'job-bookmark-enable',
'--TempDir': 's3://bucket/temp/',
'--enable-glue-datacatalog': 'true'
},
MaxRetries=1,
Timeout=120, # minutes
MaxCapacity=10 # DPUs
)
7. Missing Data Validation
## β
GOOD: Add data quality checks in Glue
from awsglue.dynamicframe import DynamicFrame
def validate_data(dynamic_frame):
df = dynamic_frame.toDF()
# Check for nulls in critical columns
null_counts = df.select(
[F.sum(F.col(c).isNull().cast('int')).alias(c) for c in df.columns]
).collect()[0].asDict()
for col, null_count in null_counts.items():
if col in ['customer_id', 'amount', 'transaction_date'] and null_count > 0:
raise ValueError(f"Critical column {col} has {null_count} null values")
# Check for duplicates
total_count = df.count()
distinct_count = df.select('transaction_id').distinct().count()
if total_count != distinct_count:
raise ValueError(f"Found {total_count - distinct_count} duplicate transactions")
return dynamic_frame
## Use in ETL job
cleaned_data = validate_data(raw_data)
Key Takeaways π―
π Quick Reference Card
| Service | Use When | Key Feature |
|---|---|---|
| Redshift | Complex SQL analytics on structured data | Columnar storage, petabyte-scale |
| Athena | Ad-hoc SQL queries on S3 data | Serverless, pay-per-query |
| Glue | ETL transformations, data cataloging | Serverless Spark, auto-scaling |
| Kinesis Streams | Real-time data processing with replay | Custom consumers, 1-7 day retention |
| Kinesis Firehose | Simple streaming to S3/Redshift/ES | Auto-scaling, zero administration |
| Lake Formation | Centralized data lake with governance | Fine-grained access control |
Decision Tree:
Need to analyze data?
β
βββ Real-time? βββ Kinesis Data Streams/Analytics
β
βββ Interactive SQL?
β β
β βββ On S3? βββ Athena
β βββ Data Warehouse? βββ Redshift
β
βββ Data preparation? βββ Glue ETL
Cost Optimization Checklist:
- β Use Parquet/ORC with compression
- β Partition data by query patterns
- β Limit SELECT columns in Athena
- β Right-size Redshift clusters (pause when idle)
- β Use Glue job bookmarks to avoid reprocessing
- β Monitor Kinesis shard utilization
- β Use S3 Intelligent-Tiering for data lakes
Security Best Practices:
- β Enable encryption at rest (S3, Redshift, Kinesis)
- β Use VPC endpoints for private connectivity
- β Implement least-privilege IAM policies
- β Use Lake Formation for centralized access control
- β Enable CloudTrail for audit logging
- β Rotate credentials regularly
π Further Study
- AWS Documentation - Data Analytics: https://docs.aws.amazon.com/analytics/
- AWS Big Data Blog: https://aws.amazon.com/blogs/big-data/
- Redshift Performance Tuning: https://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html
π§ Memory Device: Remember the data journey as CGRAK:
- Collect (Kinesis)
- Glue (Transform)
- Redshift (Warehouse)
- Athena (Query)
- Knowledge (Insights)
Master these services and you'll be architecting enterprise-scale data solutions with confidence! π