Storage & Data Services

Master S3, databases, caching, and data architecture patterns for performance and cost efficiency

AWS Storage and Data Services

Master AWS storage solutions with free flashcards and spaced repetition practice. This lesson covers Amazon S3, EBS volumes, database services like RDS and DynamoDB, data transfer tools, and storage optimization strategies—essential concepts for building scalable cloud architectures and passing AWS certification exams.

Welcome to AWS Storage & Data Services 💾

AWS provides a comprehensive suite of storage and database services designed to handle everything from object storage to fully managed relational databases. Understanding when to use each service is crucial for building efficient, cost-effective cloud solutions. Whether you're storing petabytes of data, running high-performance databases, or transferring massive datasets, AWS has a service optimized for your needs.

Core Concepts: Storage Services 🗄️

Amazon S3 (Simple Storage Service) 🪣

Amazon S3 is AWS's flagship object storage service, offering industry-leading scalability, durability, and performance. S3 stores data as objects (files) within buckets (containers).

Key Features:

Durability: 99.999999999% (11 nines) - designed to sustain the loss of data in two facilities
Scalability: Store unlimited objects, each up to 5TB
Storage Classes: Optimize costs based on access patterns
Versioning: Keep multiple versions of objects
Lifecycle Policies: Automatically transition objects between storage classes

Storage Class	Use Case	Retrieval Time	Cost
S3 Standard	Frequently accessed data	Milliseconds	Highest
S3 Intelligent-Tiering	Unknown/changing access patterns	Milliseconds	Auto-optimized
S3 Standard-IA	Infrequent access	Milliseconds	Lower
S3 One Zone-IA	Recreatable, infrequent data	Milliseconds	Lower
S3 Glacier Instant Retrieval	Archive, quarterly access	Milliseconds	Very low
S3 Glacier Flexible Retrieval	Archive, annual access	Minutes-hours	Extremely low
S3 Glacier Deep Archive	Long-term archive (7-10 years)	12-48 hours	Lowest

💡 Pro Tip: Use S3 Intelligent-Tiering when you can't predict access patterns—it automatically moves objects between tiers based on usage.

S3 Security & Access Control:

Bucket Policies: JSON-based resource policies
IAM Policies: User/role-based permissions
Access Control Lists (ACLs): Legacy granular permissions
Encryption: Server-side (SSE-S3, SSE-KMS, SSE-C) or client-side
Pre-signed URLs: Temporary access to private objects

import boto3
from botocore.exceptions import ClientError

## Create S3 client
s3 = boto3.client('s3')

## Upload file to S3
try:
    s3.upload_file(
        'local_file.txt',
        'my-bucket',
        'remote_file.txt'
    )
    print("Upload successful")
except ClientError as e:
    print(f"Error: {e}")

## Download file from S3
s3.download_file(
    'my-bucket',
    'remote_file.txt',
    'downloaded_file.txt'
)

Amazon EBS (Elastic Block Store) 💿

Amazon EBS provides persistent block-level storage volumes for EC2 instances. Think of EBS as virtual hard drives that can be attached to your virtual machines.

EBS Volume Types:

Type	Name	Use Case	IOPS	Throughput
gp3	General Purpose SSD	Boot volumes, low-latency apps	16,000	1,000 MB/s
gp2	General Purpose SSD	Legacy general purpose	16,000	250 MB/s
io2	Provisioned IOPS SSD	Mission-critical, databases	64,000	1,000 MB/s
io2 Block Express	Highest performance SSD	Largest databases	256,000	4,000 MB/s
st1	Throughput Optimized HDD	Big data, data warehouses	500	500 MB/s
sc1	Cold HDD	Infrequently accessed data	250	250 MB/s

Key EBS Features:

Snapshots: Point-in-time backups stored in S3
Encryption: AES-256 encryption at rest
Multi-Attach: Attach io2 volumes to multiple instances
Elastic Volumes: Resize, change type without downtime

## AWS CLI: Create EBS volume
aws ec2 create-volume \
    --volume-type gp3 \
    --size 100 \
    --availability-zone us-east-1a \
    --iops 3000 \
    --throughput 125

## Attach volume to instance
aws ec2 attach-volume \
    --volume-id vol-1234567890abcdef0 \
    --instance-id i-1234567890abcdef0 \
    --device /dev/sdf

## Create snapshot
aws ec2 create-snapshot \
    --volume-id vol-1234567890abcdef0 \
    --description "Backup before upgrade"

⚠️ Common Mistake: EBS volumes are availability zone-specific. You cannot attach an EBS volume in us-east-1a to an instance in us-east-1b. Use snapshots to move data between AZs.

Amazon EFS (Elastic File System) 📁

Amazon EFS provides a scalable, fully managed NFS (Network File System) that can be mounted by multiple EC2 instances simultaneously.

EFS vs EBS:

EBS (Block Storage)          EFS (File Storage)
┌─────────────┐              ┌─────────────┐
│   EC2-1     │              │   EC2-1     │
│  ┌──────┐   │              │             │
│  │ EBS  │   │              │      ╲      │
│  └──────┘   │              │       ╲     │
└─────────────┘              │        ╲    │
                             │      ┌──┴──┐│
┌─────────────┐              │      │ EFS ││
│   EC2-2     │              │      └──┬──┘│
│  ┌──────┐   │              │        ╱    │
│  │ EBS  │   │              │       ╱     │
│  └──────┘   │              │      ╱      │
└─────────────┘              └─────────────┘
One-to-one                   └─────────────┘
                             │   EC2-2     │
                             └─────────────┘
                             Many-to-one

EFS Storage Classes:

Standard: Frequently accessed files
Infrequent Access (IA): Cost-optimized for files not accessed daily
Lifecycle Management: Automatically move files to IA after N days

import boto3

## Create EFS file system
efs = boto3.client('efs')

response = efs.create_file_system(
    PerformanceMode='generalPurpose',
    ThroughputMode='elastic',
    Encrypted=True,
    Tags=[
        {'Key': 'Name', 'Value': 'shared-storage'},
    ]
)

fs_id = response['FileSystemId']
print(f"Created file system: {fs_id}")

Core Concepts: Database Services 🗃️

Amazon RDS (Relational Database Service) 🔗

Amazon RDS manages relational databases, handling backups, patching, scaling, and replication automatically.

Supported Engines:

Amazon Aurora (MySQL & PostgreSQL compatible)
MySQL
PostgreSQL
MariaDB
Oracle Database
Microsoft SQL Server

Key Features:

Automated Backups: Point-in-time recovery up to 35 days
Multi-AZ Deployments: Synchronous replication for high availability
Read Replicas: Asynchronous replication for read scaling
Automatic Failover: Multi-AZ automatically fails over in minutes

RDS MULTI-AZ ARCHITECTURE

┌─────────────────────────────────────┐
│         Availability Zone A         │
│  ┌──────────────────────────────┐   │
│  │  Primary RDS Instance        │   │
│  │  ┌────────────────────────┐  │   │
│  │  │    MySQL/PostgreSQL    │  │   │
│  │  └────────────────────────┘  │   │
│  └──────────────┬───────────────┘   │
└─────────────────┼───────────────────┘
                  │ Synchronous
                  │ Replication
                  ▼
┌─────────────────────────────────────┐
│         Availability Zone B         │
│  ┌──────────────────────────────┐   │
│  │  Standby RDS Instance        │   │
│  │  ┌────────────────────────┐  │   │
│  │  │   (Automatic Failover) │  │   │
│  │  └────────────────────────┘  │   │
│  └──────────────────────────────┘   │
└─────────────────────────────────────┘

Amazon Aurora 🌟

Aurora is AWS's cloud-native database, offering:

5x faster than standard MySQL
3x faster than standard PostgreSQL
Up to 15 read replicas with <10ms replica lag
Automatic storage scaling up to 128 TB
Global Database: Cross-region replication with <1 second latency

import boto3

rds = boto3.client('rds')

## Create RDS MySQL instance
response = rds.create_db_instance(
    DBInstanceIdentifier='mydb-instance',
    DBInstanceClass='db.t3.micro',
    Engine='mysql',
    MasterUsername='admin',
    MasterUserPassword='SecurePassword123!',
    AllocatedStorage=20,
    StorageType='gp3',
    MultiAZ=True,
    BackupRetentionPeriod=7,
    PubliclyAccessible=False,
    VpcSecurityGroupIds=['sg-12345678'],
    Tags=[
        {'Key': 'Environment', 'Value': 'production'},
    ]
)

print(f"Creating database: {response['DBInstance']['DBInstanceIdentifier']}")

Amazon DynamoDB ⚡

DynamoDB is a fully managed NoSQL database offering single-digit millisecond performance at any scale.

Key Concepts:

Tables: Collection of items (rows)
Items: Collection of attributes (columns)
Primary Key: Partition key (required) + Sort key (optional)
Secondary Indexes: Query on non-key attributes

DynamoDB Capacity Modes:

Mode	Use Case	Pricing
On-Demand	Unpredictable workloads, new apps	Per-request pricing
Provisioned	Predictable traffic	Pay for provisioned capacity

DynamoDB Features:

Global Tables: Multi-region, multi-active replication
DynamoDB Streams: Capture item-level changes
Time to Live (TTL): Automatically delete expired items
Point-in-Time Recovery: Restore to any second in last 35 days
DAX (DynamoDB Accelerator): In-memory caching, microsecond latency

import boto3
from boto3.dynamodb.conditions import Key

dynamodb = boto3.resource('dynamodb')

## Create table
table = dynamodb.create_table(
    TableName='Users',
    KeySchema=[
        {'AttributeName': 'user_id', 'KeyType': 'HASH'},  # Partition key
        {'AttributeName': 'timestamp', 'KeyType': 'RANGE'}  # Sort key
    ],
    AttributeDefinitions=[
        {'AttributeName': 'user_id', 'AttributeType': 'S'},
        {'AttributeName': 'timestamp', 'AttributeType': 'N'}
    ],
    BillingMode='PAY_PER_REQUEST'
)

## Put item
table.put_item(
    Item={
        'user_id': 'user123',
        'timestamp': 1234567890,
        'name': 'John Doe',
        'email': 'john@example.com'
    }
)

## Query items
response = table.query(
    KeyConditionExpression=Key('user_id').eq('user123')
)

for item in response['Items']:
    print(item)

Amazon Redshift 📊

Redshift is AWS's fully managed data warehouse service optimized for online analytical processing (OLAP) and business intelligence workloads.

Key Features:

Columnar Storage: Optimized for analytical queries
Massively Parallel Processing (MPP): Distributes queries across nodes
Redshift Spectrum: Query data directly in S3
Concurrency Scaling: Automatically adds capacity for concurrent queries

🧠 Memory Device: REDshift = REDuced query time for Reporting, Extracting, Data warehouse tasks

Amazon ElastiCache 🚀

ElastiCache provides fully managed in-memory caching with Redis or Memcached.

Feature	Redis	Memcached
Data Structures	Strings, Lists, Sets, Sorted Sets, Hashes	Strings only
Persistence	Yes (snapshots, AOF)	No
Replication	Yes (multi-AZ)	No
Multi-threading	No	Yes
Pub/Sub	Yes	No

import boto3

elasticache = boto3.client('elasticache')

## Create Redis cluster
response = elasticache.create_cache_cluster(
    CacheClusterId='my-redis-cluster',
    CacheNodeType='cache.t3.micro',
    Engine='redis',
    NumCacheNodes=1,
    EngineVersion='7.0',
    Port=6379,
    CacheSubnetGroupName='my-subnet-group',
    SecurityGroupIds=['sg-12345678']
)

Core Concepts: Data Transfer & Migration 🚚

AWS Snow Family ❄️

Physical data transfer devices for moving massive amounts of data into and out of AWS.

Device	Storage	Use Case	Data Transfer
Snowcone	8-14 TB	Edge computing, small transfers	Online/Offline
Snowball Edge Storage Optimized	80 TB	Large migrations, edge storage	Offline
Snowball Edge Compute Optimized	42 TB	Edge computing, ML	Offline
Snowmobile	100 PB	Exabyte-scale transfers	Offline (truck)

SNOW FAMILY SIZE COMPARISON

   Snowcone          Snowball Edge       Snowmobile
   ┌─────┐           ┌───────────┐          🚛
   │ 📦  │           │  📦📦📦   │     ┌─────────┐
   └─────┘           │  📦📦📦   │     │📦📦📦📦📦│
    8 TB             └───────────┘     │📦📦📦📦📦│
  (portable)            80 TB          │📦📦📦📦📦│
                    (ruggedized)      └─────────┘
                                         100 PB
                                      (semi-truck)

💡 Pro Tip: Use the Snow Family when transferring more than 10TB or when network bandwidth is limited. Rule of thumb: If it takes longer to transfer over the network than to ship a device, use Snow.

AWS DataSync 🔄

DataSync automates and accelerates data transfer between on-premises storage and AWS.

Key Features:

Up to 10x faster than open-source tools
Automated scheduling: Set up recurring transfers
Data validation: Verifies data integrity
Bandwidth throttling: Control network impact

Supported Destinations:

Amazon S3
Amazon EFS
Amazon FSx for Windows File Server
Amazon FSx for Lustre

import boto3

datasync = boto3.client('datasync')

## Create DataSync task
response = datasync.create_task(
    SourceLocationArn='arn:aws:datasync:us-east-1:123456789012:location/loc-abcdef',
    DestinationLocationArn='arn:aws:datasync:us-east-1:123456789012:location/loc-123456',
    CloudWatchLogGroupArn='arn:aws:logs:us-east-1:123456789012:log-group:/aws/datasync',
    Name='OnPremToS3Transfer',
    Options={
        'VerifyMode': 'POINT_IN_TIME_CONSISTENT',
        'OverwriteMode': 'ALWAYS',
        'TransferMode': 'CHANGED'
    }
)

AWS Storage Gateway 🌉

Storage Gateway provides hybrid cloud storage integration, connecting on-premises environments with AWS storage.

Gateway Types:

Type	Protocol	Use Case	Caching
File Gateway	NFS, SMB	File shares backed by S3	Yes
Volume Gateway (Stored)	iSCSI	Primary data on-premises, async backup to S3	Full local
Volume Gateway (Cached)	iSCSI	Primary data in S3, frequently accessed cached locally	Partial local
Tape Gateway	iSCSI VTL	Replace physical tape backups	N/A

STORAGE GATEWAY ARCHITECTURE

┌─────────────────────────────────┐
│    On-Premises Data Center      │
│                                 │
│  ┌──────────┐   ┌────────────┐ │
│  │   App    │──→│  Storage   │ │
│  │ Servers  │   │  Gateway   │ │
│  └──────────┘   └──────┬─────┘ │
│                        │       │
└────────────────────────┼───────┘
                         │ HTTPS
                         │
          ┌──────────────┼──────────────┐
          │         AWS Cloud           │
          │                             │
          │  ┌──────────▼───────────┐   │
          │  │    Amazon S3         │   │
          │  │  (Data Storage)      │   │
          │  └──────────────────────┘   │
          │                             │
          └─────────────────────────────┘

Example 1: Building a Three-Tier Web Application Storage Architecture 🏗️

Scenario: You're building a photo-sharing application that needs to store user uploads, serve static content, cache frequently accessed data, and maintain user metadata.

Architecture Design:

┌─────────────────────────────────────────────────┐
│              Users (Web/Mobile)                 │
└────────────┬────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────┐
│        CloudFront (CDN) + S3 Static Hosting     │  ← Static assets
└────────────┬────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────┐
│            Application Load Balancer            │
└────────────┬────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────┐
│         EC2 Auto Scaling Group                  │
│  ┌──────┐  ┌──────┐  ┌──────┐                  │
│  │ EC2  │  │ EC2  │  │ EC2  │                  │
│  └───┬──┘  └───┬──┘  └───┬──┘                  │
└──────┼─────────┼─────────┼─────────────────────┘
       │         │         │
       ├─────────┼─────────┼────→ ElastiCache (Redis)  ← Session storage
       │         │         │
       ├─────────┼─────────┼────→ S3 (Standard)        ← User uploads
       │         │         │
       └─────────┴─────────┴────→ RDS (Multi-AZ)       ← User metadata

Storage Component Breakdown:

S3 Standard for user photo uploads:

import boto3
import uuid

s3 = boto3.client('s3')

def upload_photo(file_data, user_id):
    photo_id = str(uuid.uuid4())
    key = f"photos/{user_id}/{photo_id}.jpg"
    
    s3.put_object(
        Bucket='photo-app-uploads',
        Key=key,
        Body=file_data,
        ContentType='image/jpeg',
        ServerSideEncryption='AES256',
        StorageClass='STANDARD'
    )
    
    return f"https://photo-app-uploads.s3.amazonaws.com/{key}"

S3 Lifecycle Policy to optimize costs:

lifecycle_config = {
    'Rules': [
        {
            'Id': 'MoveOldPhotosToIA',
            'Status': 'Enabled',
            'Transitions': [
                {
                    'Days': 90,
                    'StorageClass': 'STANDARD_IA'
                },
                {
                    'Days': 180,
                    'StorageClass': 'GLACIER_INSTANT_RETRIEVAL'
                }
            ],
            'Filter': {'Prefix': 'photos/'}
        }
    ]
}

s3.put_bucket_lifecycle_configuration(
    Bucket='photo-app-uploads',
    LifecycleConfiguration=lifecycle_config
)

ElastiCache Redis for session management and frequently accessed metadata:

import redis
import json

## Connect to ElastiCache Redis
redis_client = redis.Redis(
    host='my-cluster.cache.amazonaws.com',
    port=6379,
    decode_responses=True
)

def cache_user_profile(user_id, profile_data):
    key = f"user:{user_id}:profile"
    redis_client.setex(
        key,
        3600,  # TTL: 1 hour
        json.dumps(profile_data)
    )

def get_cached_profile(user_id):
    key = f"user:{user_id}:profile"
    cached = redis_client.get(key)
    return json.loads(cached) if cached else None

RDS MySQL for structured user data with read replicas:

import pymysql

## Primary database connection
primary_db = pymysql.connect(
    host='mydb.cluster-abc123.us-east-1.rds.amazonaws.com',
    user='admin',
    password='SecurePass123!',
    database='photoapp'
)

## Read replica connection for queries
replica_db = pymysql.connect(
    host='mydb.cluster-ro-abc123.us-east-1.rds.amazonaws.com',
    user='admin',
    password='SecurePass123!',
    database='photoapp'
)

def get_user_photos(user_id):
    cursor = replica_db.cursor()
    cursor.execute(
        "SELECT photo_id, s3_key, created_at FROM photos WHERE user_id = %s ORDER BY created_at DESC",
        (user_id,)
    )
    return cursor.fetchall()

Cost Optimization Strategy:

Photos older than 90 days → S3 Standard-IA (50% cost reduction)
Photos older than 180 days → Glacier Instant Retrieval (68% cost reduction)
Cache frequently accessed data in Redis (reduce database load)
Use read replicas for read-heavy operations

Example 2: Data Migration from On-Premises to AWS 📦

Scenario: Migrate 50TB of on-premises data to AWS, with ongoing synchronization during the migration period.

Migration Strategy:

MIGRATION PHASES

Phase 1: Bulk Transfer (Snowball)      Phase 2: Sync (DataSync)
┌─────────────────────────────┐        ┌─────────────────────────┐
│   On-Premises Data Center   │        │   On-Premises (Active)  │
│                             │        │                         │
│  ┌────────────────────┐     │        │  ┌────────────────┐    │
│  │   50 TB Storage    │     │        │  │   New Data     │    │
│  └────────┬───────────┘     │        │  └────────┬───────┘    │
│           │                 │        │           │            │
│           ▼                 │        │           ▼            │
│  ┌────────────────────┐     │        │  ┌────────────────┐    │
│  │  Snowball Device   │     │        │  │   DataSync     │───→│
│  └────────────────────┘     │        │  │    Agent       │    │
│           │                 │        │  └────────────────┘    │
└───────────┼─────────────────┘        └─────────────────────────┘
            │ Ship                                  │ HTTPS
            ▼                                       ▼
┌─────────────────────────────┐        ┌─────────────────────────┐
│         AWS Cloud           │        │      AWS Cloud          │
│  ┌────────────────────┐     │        │  ┌────────────────┐    │
│  │   Import to S3     │     │        │  │  Sync to S3    │    │
│  └────────────────────┘     │        │  └────────────────┘    │
└─────────────────────────────┘        └─────────────────────────┘
  (1-2 weeks, offline)                  (Ongoing, online)

Step-by-Step Implementation:

Step 1: Order and configure Snowball

## AWS CLI: Create Snowball job
aws snowball create-job \
    --job-type IMPORT \
    --resources S3Resources=[{BucketArn=arn:aws:s3:::my-migration-bucket}] \
    --address-id ADID1234-5678-90ab-cdef-1234567890ab \
    --shipping-option SECOND_DAY \
    --snowball-capacity-preference T80

Step 2: Copy data to Snowball (on-premises)

## On-premises: Using Snowball client
snowball cp /mnt/data/ s3://my-migration-bucket/ --recursive

Step 3: Ship Snowball back to AWS

AWS receives device
Data automatically imported to S3
Snowball securely erased

Step 4: Set up DataSync for ongoing sync

import boto3

datasync = boto3.client('datasync')

## Create source location (on-premises)
source_response = datasync.create_location_smb(
    ServerHostname='10.0.1.50',
    Subdirectory='/data',
    User='datasync-user',
    Password='SecurePassword123!',
    AgentArns=['arn:aws:datasync:us-east-1:123456789012:agent/agent-abc123']
)

## Create destination location (S3)
dest_response = datasync.create_location_s3(
    S3BucketArn='arn:aws:s3:::my-migration-bucket',
    S3Config={
        'BucketAccessRoleArn': 'arn:aws:iam::123456789012:role/DataSyncS3Role'
    }
)

## Create sync task
task_response = datasync.create_task(
    SourceLocationArn=source_response['LocationArn'],
    DestinationLocationArn=dest_response['LocationArn'],
    Options={
        'VerifyMode': 'POINT_IN_TIME_CONSISTENT',
        'TransferMode': 'CHANGED',
        'PreserveDeletedFiles': 'PRESERVE'
    },
    Schedule={
        'ScheduleExpression': 'rate(1 hour)'
    }
)

print(f"Migration task created: {task_response['TaskArn']}")

Step 5: Monitor and validate

## Check task execution status
executions = datasync.list_task_executions(
    TaskArn=task_response['TaskArn']
)

for execution in executions['TaskExecutions']:
    details = datasync.describe_task_execution(
        TaskExecutionArn=execution['TaskExecutionArn']
    )
    print(f"Status: {details['Status']}")
    print(f"Files transferred: {details['FilesTransferred']}")
    print(f"Bytes transferred: {details['BytesTransferred']}")

Example 3: Serverless Data Processing Pipeline 🔄

Scenario: Process uploaded CSV files, transform data, and load into a data warehouse for analytics.

Architecture:

SERVERLESS DATA PIPELINE

  ┌──────────┐
  │  User    │
  │  Upload  │
  └────┬─────┘
       │
       ▼
┌─────────────┐         ┌──────────────┐
│   S3 Bucket │────────→│   Lambda     │  S3 Event
│   (raw/)    │  Event  │  (Transform) │  Trigger
└─────────────┘         └──────┬───────┘
                               │
                               ▼
                        ┌──────────────┐
                        │   S3 Bucket  │
                        │ (processed/) │
                        └──────┬───────┘
                               │
                               ▼
                        ┌──────────────┐
                        │   Lambda     │
                        │   (Load)     │
                        └──────┬───────┘
                               │
                               ▼
                        ┌──────────────┐
                        │   Redshift   │
                        │ Data Warehouse│
                        └──────────────┘

Implementation:

Lambda Function 1: Transform CSV

import boto3
import csv
import json
from io import StringIO

s3 = boto3.client('s3')

def lambda_handler(event, context):
    # Get uploaded file details
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Read CSV from S3
    response = s3.get_object(Bucket=bucket, Key=key)
    csv_content = response['Body'].read().decode('utf-8')
    
    # Transform data
    reader = csv.DictReader(StringIO(csv_content))
    transformed_data = []
    
    for row in reader:
        transformed_row = {
            'user_id': row['id'],
            'email': row['email'].lower(),
            'signup_date': row['created'],
            'is_active': row['status'] == 'active'
        }
        transformed_data.append(transformed_row)
    
    # Write JSON to processed bucket
    output_key = key.replace('raw/', 'processed/').replace('.csv', '.json')
    s3.put_object(
        Bucket='my-data-bucket',
        Key=output_key,
        Body=json.dumps(transformed_data),
        ContentType='application/json'
    )
    
    return {
        'statusCode': 200,
        'body': f'Processed {len(transformed_data)} records'
    }

Lambda Function 2: Load to Redshift

import boto3
import psycopg2
import json

s3 = boto3.client('s3')

def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Connect to Redshift
    conn = psycopg2.connect(
        host='my-cluster.abc123.us-east-1.redshift.amazonaws.com',
        port=5439,
        dbname='analytics',
        user='admin',
        password='SecurePass123!'
    )
    cursor = conn.cursor()
    
    # Use COPY command for efficient bulk load
    copy_sql = f"""
        COPY users_staging
        FROM 's3://{bucket}/{key}'
        IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyRole'
        JSON 'auto'
        TIMEFORMAT 'auto';
    """
    
    cursor.execute(copy_sql)
    conn.commit()
    
    cursor.close()
    conn.close()
    
    return {
        'statusCode': 200,
        'body': 'Data loaded to Redshift'
    }

S3 Event Configuration:

import boto3

s3 = boto3.client('s3')
lambda_client = boto3.client('lambda')

## Grant S3 permission to invoke Lambda
lambda_client.add_permission(
    FunctionName='transform-csv-function',
    StatementId='s3-invoke-permission',
    Action='lambda:InvokeFunction',
    Principal='s3.amazonaws.com',
    SourceArn='arn:aws:s3:::my-data-bucket'
)

## Configure S3 event notification
s3.put_bucket_notification_configuration(
    Bucket='my-data-bucket',
    NotificationConfiguration={
        'LambdaFunctionConfigurations': [
            {
                'LambdaFunctionArn': 'arn:aws:lambda:us-east-1:123456789012:function:transform-csv-function',
                'Events': ['s3:ObjectCreated:*'],
                'Filter': {
                    'Key': {
                        'FilterRules': [
                            {'Name': 'prefix', 'Value': 'raw/'},
                            {'Name': 'suffix', 'Value': '.csv'}
                        ]
                    }
                }
            }
        ]
    }
)

Example 4: Disaster Recovery with Cross-Region Replication 🌍

Scenario: Implement a disaster recovery strategy with RPO (Recovery Point Objective) of 15 minutes and RTO (Recovery Time Objective) of 1 hour.

Architecture:

CROSS-REGION DISASTER RECOVERY

┌─────────────────────────────────────────┐
│       PRIMARY REGION (us-east-1)        │
│                                         │
│  ┌──────────────┐    ┌──────────────┐  │
│  │  RDS Primary │───→│ RDS Replica  │  │
│  │  Multi-AZ    │    │  (Read)      │  │
│  └──────┬───────┘    └──────────────┘  │
│         │ Snapshots                     │
│         ▼                               │
│  ┌──────────────┐                      │
│  │  S3 Bucket   │←──── Automated       │
│  │  (Primary)   │      Snapshots       │
│  └──────┬───────┘                      │
└─────────┼─────────────────────────────┘
          │ CRR (Cross-Region
          │ Replication)
          ▼
┌─────────────────────────────────────────┐
│       BACKUP REGION (us-west-2)         │
│                                         │
│  ┌──────────────┐    ┌──────────────┐  │
│  │  S3 Bucket   │    │  RDS Standby │  │
│  │  (Replica)   │───→│  (Restored   │  │
│  │              │    │   from snap) │  │
│  └──────────────┘    └──────────────┘  │
│                                         │
│  ┌──────────────┐                      │
│  │  DynamoDB    │←──── Global Tables   │
│  │  Replica     │                      │
│  └──────────────┘                      │
└─────────────────────────────────────────┘

S3 Cross-Region Replication:

import boto3

s3 = boto3.client('s3')

## Enable versioning (required for CRR)
s3.put_bucket_versioning(
    Bucket='my-primary-bucket',
    VersioningConfiguration={'Status': 'Enabled'}
)

s3.put_bucket_versioning(
    Bucket='my-backup-bucket',
    VersioningConfiguration={'Status': 'Enabled'}
)

## Configure cross-region replication
replication_config = {
    'Role': 'arn:aws:iam::123456789012:role/S3ReplicationRole',
    'Rules': [
        {
            'ID': 'ReplicateEverything',
            'Status': 'Enabled',
            'Priority': 1,
            'Filter': {},
            'Destination': {
                'Bucket': 'arn:aws:s3:::my-backup-bucket',
                'ReplicationTime': {
                    'Status': 'Enabled',
                    'Time': {'Minutes': 15}
                },
                'Metrics': {
                    'Status': 'Enabled',
                    'EventThreshold': {'Minutes': 15}
                },
                'StorageClass': 'STANDARD_IA'
            },
            'DeleteMarkerReplication': {'Status': 'Enabled'}
        }
    ]
}

s3.put_bucket_replication(
    Bucket='my-primary-bucket',
    ReplicationConfiguration=replication_config
)

DynamoDB Global Tables:

import boto3

dynamodb = boto3.client('dynamodb')

## Create global table
response = dynamodb.create_global_table(
    GlobalTableName='users-global',
    ReplicationGroup=[
        {'RegionName': 'us-east-1'},
        {'RegionName': 'us-west-2'}
    ]
)

print(f"Global table created: {response['GlobalTableDescription']['GlobalTableName']}")

Automated RDS Snapshot Copy:

import boto3

rds = boto3.client('rds', region_name='us-east-1')
rds_backup = boto3.client('rds', region_name='us-west-2')

def copy_rds_snapshot(snapshot_id):
    # Copy snapshot to backup region
    response = rds_backup.copy_db_snapshot(
        SourceDBSnapshotIdentifier=f'arn:aws:rds:us-east-1:123456789012:snapshot:{snapshot_id}',
        TargetDBSnapshotIdentifier=f'{snapshot_id}-backup',
        KmsKeyId='arn:aws:kms:us-west-2:123456789012:key/abcd1234',
        CopyTags=True
    )
    return response['DBSnapshot']['DBSnapshotIdentifier']

## Lambda function triggered by RDS snapshot creation event
def lambda_handler(event, context):
    snapshot_id = event['detail']['SourceIdentifier']
    backup_snapshot = copy_rds_snapshot(snapshot_id)
    print(f"Copied snapshot to backup region: {backup_snapshot}")

Common Mistakes ⚠️

1. Not Understanding Storage Class Transition Rules

❌ Wrong: Transitioning objects from S3 Standard to Glacier Deep Archive in 30 days

## This will fail - minimum 30 days before Glacier, 90 days before Deep Archive
lifecycle_config = {
    'Rules': [{
        'Transitions': [{
            'Days': 30,
            'StorageClass': 'DEEP_ARCHIVE'  # ❌ Too soon!
        }]
    }]
}

✅ Right: Following minimum storage duration requirements

lifecycle_config = {
    'Rules': [{
        'Transitions': [
            {'Days': 30, 'StorageClass': 'STANDARD_IA'},
            {'Days': 90, 'StorageClass': 'GLACIER_IR'},
            {'Days': 180, 'StorageClass': 'DEEP_ARCHIVE'}
        ]
    }]
}

2. Forgetting EBS Volumes are AZ-Locked

❌ Wrong: Trying to attach EBS volume from different AZ

## Instance in us-east-1a, volume in us-east-1b
aws ec2 attach-volume \
    --volume-id vol-abc123 \
    --instance-id i-def456  # ❌ Will fail!

✅ Right: Create snapshot, then create volume in target AZ

## Create snapshot
aws ec2 create-snapshot --volume-id vol-abc123

## Create volume in correct AZ from snapshot
aws ec2 create-volume \
    --snapshot-id snap-xyz789 \
    --availability-zone us-east-1a

3. Not Enabling Point-in-Time Recovery for DynamoDB

❌ Wrong: Relying only on on-demand backups

## Manual backups only
dynamodb.create_backup(
    TableName='users',
    BackupName='manual-backup'
)

✅ Right: Enable continuous backups with PITR

## Enable point-in-time recovery
dynamodb.update_continuous_backups(
    TableName='users',
    PointInTimeRecoverySpecification={
        'PointInTimeRecoveryEnabled': True
    }
)
## Now you can restore to any second in last 35 days

4. Using Wrong Read Consistency for DynamoDB

❌ Wrong: Using eventually consistent reads when strong consistency required

## Eventually consistent read (may return stale data)
response = table.get_item(
    Key={'user_id': 'user123'}
    # ConsistentRead defaults to False
)

✅ Right: Use strongly consistent reads for critical data

## Strongly consistent read (always returns latest data)
response = table.get_item(
    Key={'user_id': 'user123'},
    ConsistentRead=True  # ✅ Guarantees latest data
)

5. Not Encrypting Sensitive Data at Rest

❌ Wrong: Storing sensitive data without encryption

s3.put_object(
    Bucket='user-data',
    Key='ssn-records.csv',
    Body=sensitive_data
    # ❌ No encryption specified!
)

✅ Right: Always encrypt sensitive data

s3.put_object(
    Bucket='user-data',
    Key='ssn-records.csv',
    Body=sensitive_data,
    ServerSideEncryption='aws:kms',  # ✅ KMS encryption
    SSEKMSKeyId='arn:aws:kms:us-east-1:123456789012:key/abc123'
)

6. Choosing Wrong RDS Instance Size

🧠 Memory Device: RAMI - Read patterns, Availability needs, Memory requirements, IOPS demands

❌ Wrong: Undersizing database instance

High CPU utilization (>80% sustained)
Memory swapping
Connection pool exhaustion

✅ Right: Monitor CloudWatch metrics and scale appropriately

cloudwatch = boto3.client('cloudwatch')

## Check CPU utilization
response = cloudwatch.get_metric_statistics(
    Namespace='AWS/RDS',
    MetricName='CPUUtilization',
    Dimensions=[{'Name': 'DBInstanceIdentifier', 'Value': 'mydb'}],
    StartTime=datetime.utcnow() - timedelta(hours=1),
    EndTime=datetime.utcnow(),
    Period=300,
    Statistics=['Average']
)

avg_cpu = sum(point['Average'] for point in response['Datapoints']) / len(response['Datapoints'])
if avg_cpu > 80:
    print("⚠️ Consider scaling up RDS instance!")

7. Not Using S3 Transfer Acceleration for Large Files

❌ Wrong: Uploading large files directly to S3 from distant regions

## Slow upload from Asia to us-east-1
s3.upload_file('large_video.mp4', 'my-bucket', 'video.mp4')

✅ Right: Enable Transfer Acceleration for faster uploads

## Enable Transfer Acceleration on bucket
s3.put_bucket_accelerate_configuration(
    Bucket='my-bucket',
    AccelerateConfiguration={'Status': 'Enabled'}
)

## Use accelerated endpoint
s3_accelerated = boto3.client(
    's3',
    config=boto3.session.Config(
        s3={'use_accelerate_endpoint': True}
    )
)

s3_accelerated.upload_file('large_video.mp4', 'my-bucket', 'video.mp4')
## ✅ Up to 50-500% faster from distant locations

Key Takeaways 🎯

📋 AWS Storage & Data Services Quick Reference

Service	Type	Best For	Key Feature
S3	Object Storage	Static content, backups, data lakes	11 nines durability
EBS	Block Storage	EC2 boot/data volumes	Snapshots, encryption
EFS	File Storage	Shared file systems	Multi-instance access
RDS	Relational DB	Structured data, transactions	Automated backups, Multi-AZ
Aurora	Cloud-Native DB	High-performance SQL	5x MySQL performance
DynamoDB	NoSQL DB	Key-value, millisecond latency	Global Tables
Redshift	Data Warehouse	Analytics, BI	Columnar storage, MPP
ElastiCache	In-Memory Cache	Session storage, caching	Microsecond latency
Snow Family	Physical Transfer	Large-scale migrations	Offline data transfer
DataSync	Online Transfer	Automated sync	10x faster than open-source

🎓 Study Tips:

S3 lifecycle transitions: Standard → Standard-IA (30d) → Glacier (90d) → Deep Archive (180d)
EBS vs EFS: EBS = one instance, EFS = many instances
RDS Multi-AZ: Synchronous replication for high availability
RDS Read Replicas: Asynchronous replication for read scaling
DynamoDB consistency: Eventually consistent (default) vs Strongly consistent
Choose Snowball when: Data > 10TB or network transfer time > shipping time

🔧 Try This: Hands-On Practice

Challenge 1: Create an S3 bucket with lifecycle policies

aws s3 mb s3://my-practice-bucket-$(date +%s)
aws s3api put-bucket-lifecycle-configuration \
    --bucket my-practice-bucket-* \
    --lifecycle-configuration file://lifecycle.json

Challenge 2: Launch an RDS instance with read replica

## Create primary
aws rds create-db-instance \
    --db-instance-identifier practice-db-primary \
    --db-instance-class db.t3.micro \
    --engine mysql \
    --master-username admin \
    --master-user-password TempPass123!

## Create read replica (after primary is available)
aws rds create-db-instance-read-replica \
    --db-instance-identifier practice-db-replica \
    --source-db-instance-identifier practice-db-primary

Challenge 3: Set up DynamoDB table with auto-scaling

import boto3

dynamodb = boto3.client('dynamodb')
application_autoscaling = boto3.client('application-autoscaling')

## Create table
table = dynamodb.create_table(
    TableName='practice-table',
    KeySchema=[{'AttributeName': 'id', 'KeyType': 'HASH'}],
    AttributeDefinitions=[{'AttributeName': 'id', 'AttributeType': 'S'}],
    BillingMode='PROVISIONED',
    ProvisionedThroughput={'ReadCapacityUnits': 5, 'WriteCapacityUnits': 5}
)

## Configure auto-scaling
application_autoscaling.register_scalable_target(
    ServiceNamespace='dynamodb',
    ResourceId='table/practice-table',
    ScalableDimension='dynamodb:table:WriteCapacityUnits',
    MinCapacity=5,
    MaxCapacity=100
)

📚 Further Study

AWS Storage Services Overview - Official AWS storage service documentation
Amazon S3 Best Practices - Performance and security optimization guide
AWS Database Migration Service Guide - Comprehensive migration strategies and tools

🎉 Congratulations! You now understand AWS storage and data services. Practice with free flashcards above, and experiment with the AWS Free Tier to solidify your knowledge. Remember: choosing the right storage service depends on your access patterns, durability requirements, and cost constraints!

📝

Ready to practice?

This lesson has 15 questions to help you learn