Data & Analytics

Q: Complete the Lake Formation permissions: ```python lakeformation.grant_permissions( Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/Analyst'}, Resource={ '{{1}}': { 'DatabaseName': 'sales_db', 'Name': 'customers', '{{2}}': ['email', 'city'] } }, Permissions=['SELECT'] ) ```

["TableWithColumns","ColumnNames"]

Data lakes, streaming analytics, Glue, Athena, and data architecture patterns

AWS Data & Analytics Services

Master AWS data and analytics with free flashcards and hands-on practice. This lesson covers Amazon Redshift, AWS Glue, Amazon Athena, Amazon Kinesis, and AWS Lake Formation—essential services for building scalable data pipelines and analytics solutions on AWS.

Welcome to AWS Data & Analytics 🚀

The AWS data and analytics ecosystem offers a comprehensive suite of services for ingesting, storing, processing, and analyzing data at any scale. Whether you're building a real-time streaming application, a massive data warehouse, or a serverless query engine, AWS provides purpose-built tools that integrate seamlessly.

In this lesson, you'll learn how to architect modern data solutions using AWS managed services, understand when to use each tool, and explore real-world implementation patterns that power today's data-driven applications.

Core Concepts

🏢 Amazon Redshift: Cloud Data Warehouse

Amazon Redshift is a fully managed, petabyte-scale data warehouse service designed for Online Analytical Processing (OLAP) workloads. Unlike transactional databases, Redshift is optimized for complex queries across massive datasets.

Key Architecture Components:

┌─────────────────────────────────────────┐
│        REDSHIFT CLUSTER                 │
├─────────────────────────────────────────┤
│                                         │
│  ┌──────────┐    ┌──────────────┐     │
│  │  Leader  │───→│  Compute     │     │
│  │   Node   │    │  Node 1      │     │
│  │          │    └──────────────┘     │
│  │  Query   │    ┌──────────────┐     │
│  │  Planner │───→│  Compute     │     │
│  │          │    │  Node 2      │     │
│  └──────────┘    └──────────────┘     │
│         │         ┌──────────────┐     │
│         └────────→│  Compute     │     │
│                   │  Node N      │     │
│                   └──────────────┘     │
└─────────────────────────────────────────┘
         ↑                  ↓
    SQL Queries        Results

Columnar Storage: Redshift stores data in columns rather than rows, which dramatically improves query performance for analytical workloads:

Row Storage (OLTP)	Column Storage (OLAP)
ID\|Name\|Age\|City	ID: 1,2,3,4...
1\|Alice\|30\|NYC	Name: Alice,Bob,Carol...
2\|Bob\|25\|LA	Age: 30,25,35...
3\|Carol\|35\|SF	City: NYC,LA,SF...

💡 Tip: When querying "SELECT AVG(Age) FROM users", columnar storage only reads the Age column, skipping all other data.

Distribution Styles: Redshift distributes data across compute nodes using three strategies:

KEY: Distributes rows based on a specific column value (co-locates related data)
ALL: Replicates entire table to all nodes (small dimension tables)
EVEN: Round-robin distribution (default for tables without obvious join patterns)
AUTO: Lets Redshift choose based on table size

CREATE TABLE sales (
  sale_id INT,
  product_id INT,
  amount DECIMAL(10,2),
  sale_date DATE
)
DISTSTYLE KEY
DISTKEY (product_id)
SORTKEY (sale_date);

Redshift Spectrum: Query data directly in S3 without loading it into Redshift—perfect for infrequently accessed data or data lake integration.

🔄 AWS Glue: Serverless ETL

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare data for analytics. It's serverless—you don't provision or manage infrastructure.

Key Components:

┌──────────────────────────────────────────────┐
│            AWS GLUE ARCHITECTURE             │
├──────────────────────────────────────────────┤
│                                              │
│  📋 Data Catalog                             │
│  ┌────────────────────────────────────┐     │
│  │  Metadata Store                    │     │
│  │  • Databases                       │     │
│  │  • Tables                          │     │
│  │  • Schemas                         │     │
│  └────────────────────────────────────┘     │
│           ↑          ↑                       │
│           │          │                       │
│  🔍 Crawlers    🔧 ETL Jobs                  │
│  ┌──────────┐  ┌─────────────────┐          │
│  │ Scan S3  │  │  PySpark/Python │          │
│  │ Scan RDS │  │  Transform Data │          │
│  │ Infer    │  │  Write Output   │          │
│  │ Schema   │  │                 │          │
│  └──────────┘  └─────────────────┘          │
│       ↓              ↓                       │
│  📦 Data Sources → 📊 Data Targets          │
│  (S3, RDS, etc.)    (S3, Redshift, etc.)    │
└──────────────────────────────────────────────┘

Glue Data Catalog: A centralized metadata repository that integrates with:

Amazon Athena
Amazon Redshift Spectrum
Amazon EMR
AWS Lake Formation

Glue Crawlers: Automatically scan data sources and populate the Data Catalog:

import boto3

glue = boto3.client('glue')

response = glue.create_crawler(
    Name='s3-sales-crawler',
    Role='AWSGlueServiceRole-CrawlerAccess',
    DatabaseName='sales_db',
    Targets={
        'S3Targets': [
            {
                'Path': 's3://my-bucket/sales-data/'
            }
        ]
    },
    SchemaChangePolicy={
        'UpdateBehavior': 'UPDATE_IN_DATABASE',
        'DeleteBehavior': 'LOG'
    }
)

Glue ETL Jobs: Write transformations in PySpark or Python:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

## Read from Data Catalog
datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database="sales_db",
    table_name="raw_sales"
)

## Transform data
applymapping1 = ApplyMapping.apply(
    frame=datasource0,
    mappings=[
        ("customer_id", "string", "customer_id", "int"),
        ("amount", "string", "amount", "decimal"),
        ("date", "string", "sale_date", "date")
    ]
)

## Write to S3 in Parquet format
datasink4 = glueContext.write_dynamic_frame.from_options(
    frame=applymapping1,
    connection_type="s3",
    connection_options={"path": "s3://output-bucket/processed/"},
    format="parquet"
)

job.commit()

💡 Tip: Use Glue DataBrew for visual, no-code data preparation if you're not comfortable writing ETL scripts.

🔍 Amazon Athena: Serverless Query Service

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. You pay only for the queries you run—no infrastructure to manage.

Architecture:

┌────────────────────────────────────────┐
│         AMAZON ATHENA                  │
├────────────────────────────────────────┤
│                                        │
│  📊 Query Editor                       │
│  ┌──────────────────────────────┐     │
│  │  SELECT * FROM logs          │     │
│  │  WHERE status = 500          │     │
│  │  AND date = '2024-01-15'     │     │
│  └──────────────────────────────┘     │
│              ↓                         │
│  ⚙️ Presto Engine                      │
│  ┌──────────────────────────────┐     │
│  │  Parse → Optimize → Execute  │     │
│  └──────────────────────────────┘     │
│              ↓                         │
│  📋 Glue Data Catalog                  │
│  (Table Metadata)                      │
│              ↓                         │
│  📦 Amazon S3                          │
│  (Actual Data: CSV, JSON, Parquet)    │
└────────────────────────────────────────┘

Creating Tables:

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
  date_time STRING,
  edge_location STRING,
  bytes_sent BIGINT,
  request_ip STRING,
  method STRING,
  host STRING,
  uri STRING,
  status INT,
  user_agent STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex' = '^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+(.+)$'
)
LOCATION 's3://my-bucket/cloudfront-logs/'
TBLPROPERTIES ('skip.header.line.count'='2');

Partitioning for Performance: Partitions dramatically reduce query costs by limiting data scanned:

CREATE EXTERNAL TABLE sales_partitioned (
  transaction_id STRING,
  amount DECIMAL(10,2),
  product_id STRING
)
PARTITIONED BY (
  year INT,
  month INT,
  day INT
)
STORED AS PARQUET
LOCATION 's3://my-bucket/sales/';

-- Add partitions
ALTER TABLE sales_partitioned ADD PARTITION (year=2024, month=1, day=15)
LOCATION 's3://my-bucket/sales/year=2024/month=1/day=15/';

-- Query only specific partition
SELECT SUM(amount) 
FROM sales_partitioned
WHERE year=2024 AND month=1 AND day=15;

Optimization Tips:

Strategy	Impact	When to Use
Use columnar formats (Parquet, ORC)	90% cost reduction	Always for analytics
Partition data	80-95% cost reduction	Time-series or categorical data
Compress files	30-70% cost reduction	Large datasets
Limit SELECT columns	Varies widely	Wide tables with many columns

⚠️ Common Mistake: Storing data in JSON or CSV without compression. A 10GB JSON file might scan $0.50 per query, while the same data in compressed Parquet costs $0.05.

🌊 Amazon Kinesis: Real-Time Streaming

Amazon Kinesis enables real-time processing of streaming data at scale. It consists of four services:

┌──────────────────────────────────────────────┐
│        KINESIS SERVICE FAMILY                │
├──────────────────────────────────────────────┤
│                                              │
│  📊 Kinesis Data Streams                     │
│  ┌────────────────────────────────────┐     │
│  │  Real-time data ingestion          │     │
│  │  • Custom processing (Lambda, EC2) │     │
│  │  • Replay capability               │     │
│  │  • 1-7 days retention              │     │
│  └────────────────────────────────────┘     │
│                                              │
│  🔥 Kinesis Data Firehose                    │
│  ┌────────────────────────────────────┐     │
│  │  Load streaming data to storage    │     │
│  │  • Auto-scaling                    │     │
│  │  • Transform with Lambda           │     │
│  │  • Destinations: S3, Redshift, ES  │     │
│  └────────────────────────────────────┘     │
│                                              │
│  📈 Kinesis Data Analytics                   │
│  ┌────────────────────────────────────┐     │
│  │  SQL on streaming data             │     │
│  │  • Real-time analytics             │     │
│  │  • Anomaly detection               │     │
│  └────────────────────────────────────┘     │
│                                              │
│  📹 Kinesis Video Streams                    │
│  ┌────────────────────────────────────┐     │
│  │  Video streaming and processing    │     │
│  │  • ML inference                    │     │
│  │  • Playback                        │     │
│  └────────────────────────────────────┘     │
└──────────────────────────────────────────────┘

Kinesis Data Streams Architecture:

    Producers                Kinesis Stream              Consumers
┌──────────────┐        ┌─────────────────┐        ┌──────────────┐
│   IoT        │───┐    │  Shard 1        │───┐    │   Lambda     │
│   Devices    │   │    │  [Records...]   │   │    │   Function   │
└──────────────┘   │    └─────────────────┘   │    └──────────────┘
┌──────────────┐   ├───→┌─────────────────┐   ├───→┌──────────────┐
│   Web        │   │    │  Shard 2        │   │    │   EC2        │
│   Servers    │───┤    │  [Records...]   │   │    │   App        │
└──────────────┘   │    └─────────────────┘   │    └──────────────┘
┌──────────────┐   │    ┌─────────────────┐   │    ┌──────────────┐
│   Mobile     │───┘    │  Shard N        │───┘    │   Kinesis    │
│   Apps       │        │  [Records...]   │        │   Analytics  │
└──────────────┘        └─────────────────┘        └──────────────┘

Key Concepts:

Shard: Base throughput unit (1 MB/sec input, 2 MB/sec output)
Partition Key: Determines which shard receives the record
Sequence Number: Unique identifier for each record in a shard

Producer Example:

import boto3
import json
import time

kinesis = boto3.client('kinesis', region_name='us-east-1')

def send_event(stream_name, data, partition_key):
    response = kinesis.put_record(
        StreamName=stream_name,
        Data=json.dumps(data),
        PartitionKey=partition_key
    )
    return response['SequenceNumber']

## Send clickstream data
event_data = {
    'user_id': 'user-12345',
    'action': 'page_view',
    'page': '/products/widget-pro',
    'timestamp': int(time.time())
}

seq_num = send_event(
    stream_name='clickstream',
    data=event_data,
    partition_key=event_data['user_id']
)
print(f"Sent record: {seq_num}")

Consumer Example (Lambda):

import json
import base64

def lambda_handler(event, context):
    for record in event['Records']:
        # Kinesis data is base64 encoded
        payload = base64.b64decode(record['kinesis']['data'])
        data = json.loads(payload)
        
        print(f"Processing: {data}")
        
        # Business logic here
        if data['action'] == 'purchase':
            process_purchase(data)
        elif data['action'] == 'page_view':
            track_page_view(data)
    
    return {'statusCode': 200, 'body': 'Processed successfully'}

💡 Tip: Use Kinesis Data Firehose when you need simple delivery to S3, Redshift, or Elasticsearch without custom processing. It's fully managed and auto-scales.

🏞️ AWS Lake Formation: Data Lake Management

AWS Lake Formation makes it easy to set up a secure data lake in days instead of months. It handles data ingestion, cataloging, transformation, and security.

Data Lake Architecture:

┌──────────────────────────────────────────────────────┐
│              AWS LAKE FORMATION                      │
├──────────────────────────────────────────────────────┤
│                                                      │
│  🔒 Centralized Security & Governance                │
│  ┌────────────────────────────────────────────┐     │
│  │  • Column-level permissions                │     │
│  │  • Row-level security                      │     │
│  │  • Data access audit                       │     │
│  └────────────────────────────────────────────┘     │
│                      ↓                               │
│  📦 Data Lake Storage (S3)                           │
│  ┌──────────┬──────────┬──────────┬──────────┐     │
│  │  Raw     │ Curated  │Analytics │ Archive  │     │
│  │  Zone    │  Zone    │  Zone    │  Zone    │     │
│  └──────────┴──────────┴──────────┴──────────┘     │
│       ↓          ↓          ↓          ↓            │
│  🔧 Data Transformation (Glue ETL)                   │
│  ┌────────────────────────────────────────────┐     │
│  │  Raw → Clean → Aggregate → Optimize        │     │
│  └────────────────────────────────────────────┘     │
│                      ↓                               │
│  📊 Analytics Services                               │
│  ┌──────────┬──────────┬──────────┬──────────┐     │
│  │ Athena   │ Redshift │  EMR     │QuickSight│     │
│  │ Spectrum │          │          │          │     │
│  └──────────┴──────────┴──────────┴──────────┘     │
└──────────────────────────────────────────────────────┘

Setting Up a Data Lake:

import boto3

lakeformation = boto3.client('lakeformation')

## Register S3 location as data lake
response = lakeformation.register_resource(
    ResourceArn='arn:aws:s3:::my-data-lake',
    UseServiceLinkedRole=True
)

## Grant permissions
lakeformation.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/DataAnalyst'},
    Resource={
        'Table': {
            'DatabaseName': 'sales_db',
            'Name': 'transactions'
        }
    },
    Permissions=['SELECT'],
    PermissionsWithGrantOption=[]
)

## Column-level security
lakeformation.grant_permissions(
    Principal={'DataLakePrincipalIdentifier': 'arn:aws:iam::123456789012:role/Marketing'},
    Resource={
        'TableWithColumns': {
            'DatabaseName': 'sales_db',
            'Name': 'customers',
            'ColumnNames': ['customer_id', 'email', 'city']
            # Excludes sensitive columns like SSN, credit_card
        }
    },
    Permissions=['SELECT']
)

Data Lake Zones Pattern:

Zone	Purpose	Format	Access
Raw/Landing	Original data from sources	Any (CSV, JSON, logs)	Data Engineers
Curated/Clean	Validated, cleaned data	Parquet, ORC	Data Scientists, Analysts
Analytics	Aggregated, business-ready	Parquet (partitioned)	Business Users, BI Tools
Archive	Historical data	Compressed Parquet	Compliance, Auditors

Real-World Examples

Example 1: Building a Serverless Data Pipeline 🚰

Scenario: A retail company wants to analyze customer behavior from web clickstreams in real-time and generate daily reports.

Architecture:

┌────────────┐     ┌──────────────┐     ┌──────────┐
│  Website   │────→│   Kinesis    │────→│  Lambda  │
│  (Events)  │     │ Data Firehose│     │(Transform)│
└────────────┘     └──────────────┘     └────┬─────┘
                                              ↓
                   ┌─────────────────────────────┐
                   │        S3 Data Lake         │
                   │  /raw/clickstream/          │
                   │  /processed/clickstream/    │
                   └──────────┬──────────────────┘
                              ↓
              ┌───────────────┴───────────────┐
              ↓                               ↓
     ┌────────────────┐              ┌──────────────┐
     │  Glue Crawler  │              │    Athena    │
     │ (Catalog Data) │              │ (SQL Queries)│
     └────────────────┘              └──────┬───────┘
                                             ↓
                                    ┌────────────────┐
                                    │  QuickSight    │
                                    │  (Dashboard)   │
                                    └────────────────┘

Implementation:

## Step 1: Configure Firehose delivery stream
import boto3

firehose = boto3.client('firehose')

firehose.create_delivery_stream(
    DeliveryStreamName='clickstream-to-s3',
    DeliveryStreamType='DirectPut',
    S3DestinationConfiguration={
        'RoleARN': 'arn:aws:iam::123456789012:role/FirehoseRole',
        'BucketARN': 'arn:aws:s3:::my-data-lake',
        'Prefix': 'raw/clickstream/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/',
        'ErrorOutputPrefix': 'errors/clickstream/',
        'BufferingHints': {
            'SizeInMBs': 5,
            'IntervalInSeconds': 300
        },
        'CompressionFormat': 'GZIP',
        'ProcessingConfiguration': {
            'Enabled': True,
            'Processors': [
                {
                    'Type': 'Lambda',
                    'Parameters': [
                        {
                            'ParameterName': 'LambdaArn',
                            'ParameterValue': 'arn:aws:lambda:us-east-1:123456789012:function:EnrichClickstream'
                        }
                    ]
                }
            ]
        }
    }
)

## Step 2: Lambda transformation function
def lambda_handler(event, context):
    import json
    import base64
    from datetime import datetime
    
    output = []
    
    for record in event['records']:
        payload = json.loads(base64.b64decode(record['data']))
        
        # Enrich with additional fields
        payload['processed_at'] = datetime.utcnow().isoformat()
        payload['user_agent_parsed'] = parse_user_agent(payload.get('user_agent', ''))
        
        # Transform and encode
        output_record = {
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': base64.b64encode(json.dumps(payload).encode('utf-8')).decode('utf-8')
        }
        output.append(output_record)
    
    return {'records': output}

## Step 3: Query with Athena
athena_query = """
SELECT 
    date_trunc('hour', timestamp) as hour,
    page,
    COUNT(*) as page_views,
    COUNT(DISTINCT user_id) as unique_visitors
FROM clickstream
WHERE year = 2024 AND month = 1 AND day = 15
GROUP BY 1, 2
ORDER BY page_views DESC;
"""

Cost Optimization:

Using Parquet with GZIP: ~$0.05 per GB scanned
With partitioning by date: Only scan relevant day's data
1 TB daily data → 365 TB yearly → Query 1 day = only ~1 TB scanned

Example 2: Data Warehouse Modernization 🏗️

Scenario: Migrate from an on-premises Oracle data warehouse to Amazon Redshift.

Migration Strategy:

## Using AWS Database Migration Service (DMS)
import boto3

dms = boto3.client('dms')

## Create replication instance
dms.create_replication_instance(
    ReplicationInstanceIdentifier='oracle-to-redshift',
    ReplicationInstanceClass='dms.c5.xlarge',
    AllocatedStorage=100,
    EngineVersion='3.4.6'
)

## Create source endpoint (Oracle)
dms.create_endpoint(
    EndpointIdentifier='oracle-source',
    EndpointType='source',
    EngineName='oracle',
    ServerName='oracle.company.local',
    Port=1521,
    DatabaseName='PRODDB',
    Username='dms_user',
    Password='secure_password'
)

## Create target endpoint (Redshift)
dms.create_endpoint(
    EndpointIdentifier='redshift-target',
    EndpointType='target',
    EngineName='redshift',
    ServerName='my-redshift-cluster.abc123.us-east-1.redshift.amazonaws.com',
    Port=5439,
    DatabaseName='analytics',
    Username='admin',
    Password='redshift_password'
)

## Create migration task
dms.create_replication_task(
    ReplicationTaskIdentifier='full-load-and-cdc',
    SourceEndpointArn='arn:aws:dms:us-east-1:123456789012:endpoint:oracle-source',
    TargetEndpointArn='arn:aws:dms:us-east-1:123456789012:endpoint:redshift-target',
    ReplicationInstanceArn='arn:aws:dms:us-east-1:123456789012:rep:oracle-to-redshift',
    MigrationType='full-load-and-cdc',  # Full load + ongoing replication
    TableMappings='''
    {
        "rules": [
            {
                "rule-type": "selection",
                "rule-id": "1",
                "rule-name": "migrate-sales-tables",
                "object-locator": {
                    "schema-name": "SALES",
                    "table-name": "%"
                },
                "rule-action": "include"
            }
        ]
    }
    '''
)

Post-Migration Optimization:

-- Analyze table statistics for query planner
ANALYZE sales_fact;

-- Create materialized views for common queries
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT 
    DATE_TRUNC('day', sale_date) as day,
    product_category,
    SUM(amount) as total_sales,
    COUNT(*) as transaction_count
FROM sales_fact
GROUP BY 1, 2;

-- Set up automatic refresh
CREATE MATERIALIZED VIEW mv_top_customers 
AUTO REFRESH YES AS
SELECT 
    customer_id,
    SUM(amount) as lifetime_value
FROM sales_fact
GROUP BY customer_id
ORDER BY lifetime_value DESC
LIMIT 1000;

-- Vacuum and analyze regularly (in maintenance window)
VACUUM SORT ONLY sales_fact;
VACUUM DELETE ONLY sales_fact;
ANALYZE sales_fact;

Example 3: Real-Time Anomaly Detection 🚨

Scenario: Monitor IoT sensor data for anomalies using Kinesis Data Analytics.

SQL-based Stream Processing:

-- Create input stream from Kinesis Data Stream
CREATE OR REPLACE STREAM "SOURCE_SQL_STREAM" (
    "sensor_id" VARCHAR(32),
    "temperature" DOUBLE,
    "pressure" DOUBLE,
    "timestamp" TIMESTAMP
);

-- Detect anomalies using sliding window
CREATE OR REPLACE STREAM "ANOMALY_STREAM" (
    "sensor_id" VARCHAR(32),
    "current_temp" DOUBLE,
    "avg_temp" DOUBLE,
    "stddev_temp" DOUBLE,
    "anomaly_score" DOUBLE,
    "window_end" TIMESTAMP
);

CREATE OR REPLACE PUMP "ANOMALY_PUMP" AS 
INSERT INTO "ANOMALY_STREAM"
SELECT STREAM
    "sensor_id",
    "temperature" as "current_temp",
    AVG("temperature") OVER W as "avg_temp",
    STDDEV("temperature") OVER W as "stddev_temp",
    ABS("temperature" - AVG("temperature") OVER W) / STDDEV("temperature") OVER W as "anomaly_score",
    STEP("timestamp" BY INTERVAL '1' MINUTE)
FROM "SOURCE_SQL_STREAM"
WINDOW W AS (
    PARTITION BY "sensor_id"
    RANGE INTERVAL '10' MINUTE PRECEDING
)
HAVING ABS("temperature" - AVG("temperature") OVER W) > 2 * STDDEV("temperature") OVER W;

-- Output to destination stream (triggers alerts via Lambda)
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
    "sensor_id" VARCHAR(32),
    "alert_type" VARCHAR(16),
    "severity" VARCHAR(8),
    "message" VARCHAR(256)
);

CREATE OR REPLACE PUMP "ALERT_PUMP" AS
INSERT INTO "DESTINATION_SQL_STREAM"
SELECT STREAM
    "sensor_id",
    'TEMPERATURE_ANOMALY' as "alert_type",
    CASE 
        WHEN "anomaly_score" > 4 THEN 'CRITICAL'
        WHEN "anomaly_score" > 3 THEN 'HIGH'
        ELSE 'MEDIUM'
    END as "severity",
    'Temperature deviation: ' || CAST("anomaly_score" AS VARCHAR) || ' std devs' as "message"
FROM "ANOMALY_STREAM";

Example 4: Multi-Account Data Lake 🌐

Scenario: Enterprise with multiple AWS accounts needs centralized data lake with granular access control.

Lake Formation Cross-Account Setup:

import boto3

## In the data lake account
lakeformation = boto3.client('lakeformation')

## Grant external account access to database
lakeformation.grant_permissions(
    Principal={
        'DataLakePrincipalIdentifier': 'arn:aws:iam::999999999999:root'  # Consumer account
    },
    Resource={
        'Database': {
            'Name': 'shared_analytics_db'
        }
    },
    Permissions=['DESCRIBE'],
    PermissionsWithGrantOption=['DESCRIBE']
)

## Grant table-level permissions with column filtering
lakeformation.grant_permissions(
    Principal={
        'DataLakePrincipalIdentifier': 'arn:aws:iam::999999999999:role/DataScientist'
    },
    Resource={
        'TableWithColumns': {
            'DatabaseName': 'shared_analytics_db',
            'Name': 'customer_transactions',
            'ColumnNames': ['transaction_id', 'amount', 'product_id', 'timestamp'],
            'ColumnWildcard': {
                'ExcludedColumnNames': ['customer_ssn', 'credit_card_number']
            }
        }
    },
    Permissions=['SELECT']
)

## Row-level security with data filters
lakeformation.create_data_cells_filter(
    TableData={
        'DatabaseName': 'shared_analytics_db',
        'TableName': 'customer_transactions',
        'Name': 'region_filter_us_only',
        'RowFilter': {
            'FilterExpression': "region = 'US'"
        }
    }
)

Common Mistakes to Avoid ⚠️

1. Choosing the Wrong Service for the Use Case

❌ Wrong: Using Redshift for operational queries with high concurrency (thousands of QPS) ✅ Right: Use RDS or DynamoDB for operational workloads; Redshift for analytical queries

❌ Wrong: Loading data into Kinesis Data Streams when you just need S3 delivery ✅ Right: Use Kinesis Data Firehose for simple streaming to S3/Redshift

2. Poor Data Partitioning

-- ❌ BAD: No partitioning
CREATE EXTERNAL TABLE logs (
    timestamp STRING,
    message STRING,
    level STRING
)
LOCATION 's3://bucket/logs/';
-- Every query scans ALL data!

-- ✅ GOOD: Partitioned by date
CREATE EXTERNAL TABLE logs (
    timestamp STRING,
    message STRING,
    level STRING
)
PARTITIONED BY (year INT, month INT, day INT)
LOCATION 's3://bucket/logs/';
-- Query scans only relevant partitions

3. Ignoring Redshift Distribution Keys

-- ❌ BAD: No distribution strategy for large fact table
CREATE TABLE sales (
    sale_id BIGINT,
    product_id INT,
    customer_id INT,
    amount DECIMAL(10,2)
);
-- Massive data shuffling during joins!

-- ✅ GOOD: Distribute on frequently joined column
CREATE TABLE sales (
    sale_id BIGINT,
    product_id INT,
    customer_id INT,
    amount DECIMAL(10,2)
)
DISTSTYLE KEY
DISTKEY (product_id)
SORTKEY (sale_id);
-- Co-locates related data, minimizes shuffling

4. Not Using Columnar Formats

Format	10GB Dataset Query Cost	Query Performance
JSON (uncompressed)	$0.50	Slow (full scan)
CSV (gzip)	$0.25	Medium (compressed scan)
Parquet (snappy)	$0.05	Fast (columnar, predicate pushdown)

5. Over-Sharding Kinesis Streams

💡 Remember: Each shard costs ~$0.015/hour ($10.80/month)

❌ Wrong: Provisioning 100 shards for 10 MB/sec throughput ✅ Right: 10 shards handle 10 MB/sec (1 MB/sec per shard)

6. Not Monitoring Glue Job Metrics

## ✅ GOOD: Enable job metrics and bookmarks
glue.create_job(
    Name='sales-etl',
    Role='AWSGlueServiceRole',
    Command={
        'Name': 'glueetl',
        'ScriptLocation': 's3://bucket/scripts/etl.py'
    },
    DefaultArguments={
        '--enable-metrics': 'true',
        '--enable-continuous-cloudwatch-log': 'true',
        '--enable-job-insights': 'true',
        '--job-bookmark-option': 'job-bookmark-enable',
        '--TempDir': 's3://bucket/temp/',
        '--enable-glue-datacatalog': 'true'
    },
    MaxRetries=1,
    Timeout=120,  # minutes
    MaxCapacity=10  # DPUs
)

7. Missing Data Validation

## ✅ GOOD: Add data quality checks in Glue
from awsglue.dynamicframe import DynamicFrame

def validate_data(dynamic_frame):
    df = dynamic_frame.toDF()
    
    # Check for nulls in critical columns
    null_counts = df.select(
        [F.sum(F.col(c).isNull().cast('int')).alias(c) for c in df.columns]
    ).collect()[0].asDict()
    
    for col, null_count in null_counts.items():
        if col in ['customer_id', 'amount', 'transaction_date'] and null_count > 0:
            raise ValueError(f"Critical column {col} has {null_count} null values")
    
    # Check for duplicates
    total_count = df.count()
    distinct_count = df.select('transaction_id').distinct().count()
    
    if total_count != distinct_count:
        raise ValueError(f"Found {total_count - distinct_count} duplicate transactions")
    
    return dynamic_frame

## Use in ETL job
cleaned_data = validate_data(raw_data)

Key Takeaways 🎯

📋 Quick Reference Card

Service	Use When	Key Feature
Redshift	Complex SQL analytics on structured data	Columnar storage, petabyte-scale
Athena	Ad-hoc SQL queries on S3 data	Serverless, pay-per-query
Glue	ETL transformations, data cataloging	Serverless Spark, auto-scaling
Kinesis Streams	Real-time data processing with replay	Custom consumers, 1-7 day retention
Kinesis Firehose	Simple streaming to S3/Redshift/ES	Auto-scaling, zero administration
Lake Formation	Centralized data lake with governance	Fine-grained access control

Decision Tree:

Need to analyze data?
     │
     ├─→ Real-time? ──→ Kinesis Data Streams/Analytics
     │
     ├─→ Interactive SQL? 
     │        │
     │        ├─→ On S3? ──→ Athena
     │        └─→ Data Warehouse? ──→ Redshift
     │
     └─→ Data preparation? ──→ Glue ETL

Cost Optimization Checklist:

✅ Use Parquet/ORC with compression
✅ Partition data by query patterns
✅ Limit SELECT columns in Athena
✅ Right-size Redshift clusters (pause when idle)
✅ Use Glue job bookmarks to avoid reprocessing
✅ Monitor Kinesis shard utilization
✅ Use S3 Intelligent-Tiering for data lakes

Security Best Practices:

✅ Enable encryption at rest (S3, Redshift, Kinesis)
✅ Use VPC endpoints for private connectivity
✅ Implement least-privilege IAM policies
✅ Use Lake Formation for centralized access control
✅ Enable CloudTrail for audit logging
✅ Rotate credentials regularly

📚 Further Study

AWS Documentation - Data Analytics: https://docs.aws.amazon.com/analytics/
AWS Big Data Blog: https://aws.amazon.com/blogs/big-data/
Redshift Performance Tuning: https://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html

🧠 Memory Device: Remember the data journey as CGRAK:

Collect (Kinesis)
Glue (Transform)
Redshift (Warehouse)
Athena (Query)
Knowledge (Insights)

Master these services and you'll be architecting enterprise-scale data solutions with confidence! 🚀

📝

Ready to practice?

This lesson has 15 questions to help you learn