Storage & Data Services
Master S3, databases, caching, and data architecture patterns for performance and cost efficiency
AWS Storage and Data Services
Master AWS storage solutions with free flashcards and spaced repetition practice. This lesson covers Amazon S3, EBS volumes, database services like RDS and DynamoDB, data transfer tools, and storage optimization strategiesβessential concepts for building scalable cloud architectures and passing AWS certification exams.
Welcome to AWS Storage & Data Services πΎ
AWS provides a comprehensive suite of storage and database services designed to handle everything from object storage to fully managed relational databases. Understanding when to use each service is crucial for building efficient, cost-effective cloud solutions. Whether you're storing petabytes of data, running high-performance databases, or transferring massive datasets, AWS has a service optimized for your needs.
Core Concepts: Storage Services ποΈ
Amazon S3 (Simple Storage Service) πͺ£
Amazon S3 is AWS's flagship object storage service, offering industry-leading scalability, durability, and performance. S3 stores data as objects (files) within buckets (containers).
Key Features:
- Durability: 99.999999999% (11 nines) - designed to sustain the loss of data in two facilities
- Scalability: Store unlimited objects, each up to 5TB
- Storage Classes: Optimize costs based on access patterns
- Versioning: Keep multiple versions of objects
- Lifecycle Policies: Automatically transition objects between storage classes
| Storage Class | Use Case | Retrieval Time | Cost |
|---|---|---|---|
| S3 Standard | Frequently accessed data | Milliseconds | Highest |
| S3 Intelligent-Tiering | Unknown/changing access patterns | Milliseconds | Auto-optimized |
| S3 Standard-IA | Infrequent access | Milliseconds | Lower |
| S3 One Zone-IA | Recreatable, infrequent data | Milliseconds | Lower |
| S3 Glacier Instant Retrieval | Archive, quarterly access | Milliseconds | Very low |
| S3 Glacier Flexible Retrieval | Archive, annual access | Minutes-hours | Extremely low |
| S3 Glacier Deep Archive | Long-term archive (7-10 years) | 12-48 hours | Lowest |
π‘ Pro Tip: Use S3 Intelligent-Tiering when you can't predict access patternsβit automatically moves objects between tiers based on usage.
S3 Security & Access Control:
- Bucket Policies: JSON-based resource policies
- IAM Policies: User/role-based permissions
- Access Control Lists (ACLs): Legacy granular permissions
- Encryption: Server-side (SSE-S3, SSE-KMS, SSE-C) or client-side
- Pre-signed URLs: Temporary access to private objects
import boto3
from botocore.exceptions import ClientError
## Create S3 client
s3 = boto3.client('s3')
## Upload file to S3
try:
s3.upload_file(
'local_file.txt',
'my-bucket',
'remote_file.txt'
)
print("Upload successful")
except ClientError as e:
print(f"Error: {e}")
## Download file from S3
s3.download_file(
'my-bucket',
'remote_file.txt',
'downloaded_file.txt'
)
Amazon EBS (Elastic Block Store) πΏ
Amazon EBS provides persistent block-level storage volumes for EC2 instances. Think of EBS as virtual hard drives that can be attached to your virtual machines.
EBS Volume Types:
| Type | Name | Use Case | IOPS | Throughput |
|---|---|---|---|---|
| gp3 | General Purpose SSD | Boot volumes, low-latency apps | 16,000 | 1,000 MB/s |
| gp2 | General Purpose SSD | Legacy general purpose | 16,000 | 250 MB/s |
| io2 | Provisioned IOPS SSD | Mission-critical, databases | 64,000 | 1,000 MB/s |
| io2 Block Express | Highest performance SSD | Largest databases | 256,000 | 4,000 MB/s |
| st1 | Throughput Optimized HDD | Big data, data warehouses | 500 | 500 MB/s |
| sc1 | Cold HDD | Infrequently accessed data | 250 | 250 MB/s |
Key EBS Features:
- Snapshots: Point-in-time backups stored in S3
- Encryption: AES-256 encryption at rest
- Multi-Attach: Attach io2 volumes to multiple instances
- Elastic Volumes: Resize, change type without downtime
## AWS CLI: Create EBS volume
aws ec2 create-volume \
--volume-type gp3 \
--size 100 \
--availability-zone us-east-1a \
--iops 3000 \
--throughput 125
## Attach volume to instance
aws ec2 attach-volume \
--volume-id vol-1234567890abcdef0 \
--instance-id i-1234567890abcdef0 \
--device /dev/sdf
## Create snapshot
aws ec2 create-snapshot \
--volume-id vol-1234567890abcdef0 \
--description "Backup before upgrade"
β οΈ Common Mistake: EBS volumes are availability zone-specific. You cannot attach an EBS volume in us-east-1a to an instance in us-east-1b. Use snapshots to move data between AZs.
Amazon EFS (Elastic File System) π
Amazon EFS provides a scalable, fully managed NFS (Network File System) that can be mounted by multiple EC2 instances simultaneously.
EFS vs EBS:
EBS (Block Storage) EFS (File Storage)
βββββββββββββββ βββββββββββββββ
β EC2-1 β β EC2-1 β
β ββββββββ β β β
β β EBS β β β β² β
β ββββββββ β β β² β
βββββββββββββββ β β² β
β ββββ΄ββββ
βββββββββββββββ β β EFS ββ
β EC2-2 β β ββββ¬ββββ
β ββββββββ β β β± β
β β EBS β β β β± β
β ββββββββ β β β± β
βββββββββββββββ βββββββββββββββ
One-to-one βββββββββββββββ
β EC2-2 β
βββββββββββββββ
Many-to-one
EFS Storage Classes:
- Standard: Frequently accessed files
- Infrequent Access (IA): Cost-optimized for files not accessed daily
- Lifecycle Management: Automatically move files to IA after N days
import boto3
## Create EFS file system
efs = boto3.client('efs')
response = efs.create_file_system(
PerformanceMode='generalPurpose',
ThroughputMode='elastic',
Encrypted=True,
Tags=[
{'Key': 'Name', 'Value': 'shared-storage'},
]
)
fs_id = response['FileSystemId']
print(f"Created file system: {fs_id}")
Core Concepts: Database Services ποΈ
Amazon RDS (Relational Database Service) π
Amazon RDS manages relational databases, handling backups, patching, scaling, and replication automatically.
Supported Engines:
- Amazon Aurora (MySQL & PostgreSQL compatible)
- MySQL
- PostgreSQL
- MariaDB
- Oracle Database
- Microsoft SQL Server
Key Features:
- Automated Backups: Point-in-time recovery up to 35 days
- Multi-AZ Deployments: Synchronous replication for high availability
- Read Replicas: Asynchronous replication for read scaling
- Automatic Failover: Multi-AZ automatically fails over in minutes
RDS MULTI-AZ ARCHITECTURE
βββββββββββββββββββββββββββββββββββββββ
β Availability Zone A β
β ββββββββββββββββββββββββββββββββ β
β β Primary RDS Instance β β
β β ββββββββββββββββββββββββββ β β
β β β MySQL/PostgreSQL β β β
β β ββββββββββββββββββββββββββ β β
β ββββββββββββββββ¬ββββββββββββββββ β
βββββββββββββββββββΌββββββββββββββββββββ
β Synchronous
β Replication
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Availability Zone B β
β ββββββββββββββββββββββββββββββββ β
β β Standby RDS Instance β β
β β ββββββββββββββββββββββββββ β β
β β β (Automatic Failover) β β β
β β ββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββ
Amazon Aurora π
Aurora is AWS's cloud-native database, offering:
- 5x faster than standard MySQL
- 3x faster than standard PostgreSQL
- Up to 15 read replicas with <10ms replica lag
- Automatic storage scaling up to 128 TB
- Global Database: Cross-region replication with <1 second latency
import boto3
rds = boto3.client('rds')
## Create RDS MySQL instance
response = rds.create_db_instance(
DBInstanceIdentifier='mydb-instance',
DBInstanceClass='db.t3.micro',
Engine='mysql',
MasterUsername='admin',
MasterUserPassword='SecurePassword123!',
AllocatedStorage=20,
StorageType='gp3',
MultiAZ=True,
BackupRetentionPeriod=7,
PubliclyAccessible=False,
VpcSecurityGroupIds=['sg-12345678'],
Tags=[
{'Key': 'Environment', 'Value': 'production'},
]
)
print(f"Creating database: {response['DBInstance']['DBInstanceIdentifier']}")
Amazon DynamoDB β‘
DynamoDB is a fully managed NoSQL database offering single-digit millisecond performance at any scale.
Key Concepts:
- Tables: Collection of items (rows)
- Items: Collection of attributes (columns)
- Primary Key: Partition key (required) + Sort key (optional)
- Secondary Indexes: Query on non-key attributes
DynamoDB Capacity Modes:
| Mode | Use Case | Pricing |
|---|---|---|
| On-Demand | Unpredictable workloads, new apps | Per-request pricing |
| Provisioned | Predictable traffic | Pay for provisioned capacity |
DynamoDB Features:
- Global Tables: Multi-region, multi-active replication
- DynamoDB Streams: Capture item-level changes
- Time to Live (TTL): Automatically delete expired items
- Point-in-Time Recovery: Restore to any second in last 35 days
- DAX (DynamoDB Accelerator): In-memory caching, microsecond latency
import boto3
from boto3.dynamodb.conditions import Key
dynamodb = boto3.resource('dynamodb')
## Create table
table = dynamodb.create_table(
TableName='Users',
KeySchema=[
{'AttributeName': 'user_id', 'KeyType': 'HASH'}, # Partition key
{'AttributeName': 'timestamp', 'KeyType': 'RANGE'} # Sort key
],
AttributeDefinitions=[
{'AttributeName': 'user_id', 'AttributeType': 'S'},
{'AttributeName': 'timestamp', 'AttributeType': 'N'}
],
BillingMode='PAY_PER_REQUEST'
)
## Put item
table.put_item(
Item={
'user_id': 'user123',
'timestamp': 1234567890,
'name': 'John Doe',
'email': 'john@example.com'
}
)
## Query items
response = table.query(
KeyConditionExpression=Key('user_id').eq('user123')
)
for item in response['Items']:
print(item)
Amazon Redshift π
Redshift is AWS's fully managed data warehouse service optimized for online analytical processing (OLAP) and business intelligence workloads.
Key Features:
- Columnar Storage: Optimized for analytical queries
- Massively Parallel Processing (MPP): Distributes queries across nodes
- Redshift Spectrum: Query data directly in S3
- Concurrency Scaling: Automatically adds capacity for concurrent queries
π§ Memory Device: REDshift = REDuced query time for Reporting, Extracting, Data warehouse tasks
Amazon ElastiCache π
ElastiCache provides fully managed in-memory caching with Redis or Memcached.
| Feature | Redis | Memcached |
|---|---|---|
| Data Structures | Strings, Lists, Sets, Sorted Sets, Hashes | Strings only |
| Persistence | Yes (snapshots, AOF) | No |
| Replication | Yes (multi-AZ) | No |
| Multi-threading | No | Yes |
| Pub/Sub | Yes | No |
import boto3
elasticache = boto3.client('elasticache')
## Create Redis cluster
response = elasticache.create_cache_cluster(
CacheClusterId='my-redis-cluster',
CacheNodeType='cache.t3.micro',
Engine='redis',
NumCacheNodes=1,
EngineVersion='7.0',
Port=6379,
CacheSubnetGroupName='my-subnet-group',
SecurityGroupIds=['sg-12345678']
)
Core Concepts: Data Transfer & Migration π
AWS Snow Family βοΈ
Physical data transfer devices for moving massive amounts of data into and out of AWS.
| Device | Storage | Use Case | Data Transfer |
|---|---|---|---|
| Snowcone | 8-14 TB | Edge computing, small transfers | Online/Offline |
| Snowball Edge Storage Optimized | 80 TB | Large migrations, edge storage | Offline |
| Snowball Edge Compute Optimized | 42 TB | Edge computing, ML | Offline |
| Snowmobile | 100 PB | Exabyte-scale transfers | Offline (truck) |
SNOW FAMILY SIZE COMPARISON
Snowcone Snowball Edge Snowmobile
βββββββ βββββββββββββ π
β π¦ β β π¦π¦π¦ β βββββββββββ
βββββββ β π¦π¦π¦ β βπ¦π¦π¦π¦π¦β
8 TB βββββββββββββ βπ¦π¦π¦π¦π¦β
(portable) 80 TB βπ¦π¦π¦π¦π¦β
(ruggedized) βββββββββββ
100 PB
(semi-truck)
π‘ Pro Tip: Use the Snow Family when transferring more than 10TB or when network bandwidth is limited. Rule of thumb: If it takes longer to transfer over the network than to ship a device, use Snow.
AWS DataSync π
DataSync automates and accelerates data transfer between on-premises storage and AWS.
Key Features:
- Up to 10x faster than open-source tools
- Automated scheduling: Set up recurring transfers
- Data validation: Verifies data integrity
- Bandwidth throttling: Control network impact
Supported Destinations:
- Amazon S3
- Amazon EFS
- Amazon FSx for Windows File Server
- Amazon FSx for Lustre
import boto3
datasync = boto3.client('datasync')
## Create DataSync task
response = datasync.create_task(
SourceLocationArn='arn:aws:datasync:us-east-1:123456789012:location/loc-abcdef',
DestinationLocationArn='arn:aws:datasync:us-east-1:123456789012:location/loc-123456',
CloudWatchLogGroupArn='arn:aws:logs:us-east-1:123456789012:log-group:/aws/datasync',
Name='OnPremToS3Transfer',
Options={
'VerifyMode': 'POINT_IN_TIME_CONSISTENT',
'OverwriteMode': 'ALWAYS',
'TransferMode': 'CHANGED'
}
)
AWS Storage Gateway π
Storage Gateway provides hybrid cloud storage integration, connecting on-premises environments with AWS storage.
Gateway Types:
| Type | Protocol | Use Case | Caching |
|---|---|---|---|
| File Gateway | NFS, SMB | File shares backed by S3 | Yes |
| Volume Gateway (Stored) | iSCSI | Primary data on-premises, async backup to S3 | Full local |
| Volume Gateway (Cached) | iSCSI | Primary data in S3, frequently accessed cached locally | Partial local |
| Tape Gateway | iSCSI VTL | Replace physical tape backups | N/A |
STORAGE GATEWAY ARCHITECTURE
βββββββββββββββββββββββββββββββββββ
β On-Premises Data Center β
β β
β ββββββββββββ ββββββββββββββ β
β β App βββββ Storage β β
β β Servers β β Gateway β β
β ββββββββββββ ββββββββ¬ββββββ β
β β β
ββββββββββββββββββββββββββΌββββββββ
β HTTPS
β
ββββββββββββββββΌβββββββββββββββ
β AWS Cloud β
β β
β ββββββββββββΌββββββββββββ β
β β Amazon S3 β β
β β (Data Storage) β β
β ββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββ
Example 1: Building a Three-Tier Web Application Storage Architecture ποΈ
Scenario: You're building a photo-sharing application that needs to store user uploads, serve static content, cache frequently accessed data, and maintain user metadata.
Architecture Design:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Users (Web/Mobile) β
ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β CloudFront (CDN) + S3 Static Hosting β β Static assets
ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Application Load Balancer β
ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β EC2 Auto Scaling Group β
β ββββββββ ββββββββ ββββββββ β
β β EC2 β β EC2 β β EC2 β β
β βββββ¬βββ βββββ¬βββ βββββ¬βββ β
ββββββββΌββββββββββΌββββββββββΌββββββββββββββββββββββ
β β β
βββββββββββΌββββββββββΌβββββ ElastiCache (Redis) β Session storage
β β β
βββββββββββΌββββββββββΌβββββ S3 (Standard) β User uploads
β β β
βββββββββββ΄ββββββββββ΄βββββ RDS (Multi-AZ) β User metadata
Storage Component Breakdown:
- S3 Standard for user photo uploads:
import boto3
import uuid
s3 = boto3.client('s3')
def upload_photo(file_data, user_id):
photo_id = str(uuid.uuid4())
key = f"photos/{user_id}/{photo_id}.jpg"
s3.put_object(
Bucket='photo-app-uploads',
Key=key,
Body=file_data,
ContentType='image/jpeg',
ServerSideEncryption='AES256',
StorageClass='STANDARD'
)
return f"https://photo-app-uploads.s3.amazonaws.com/{key}"
- S3 Lifecycle Policy to optimize costs:
lifecycle_config = {
'Rules': [
{
'Id': 'MoveOldPhotosToIA',
'Status': 'Enabled',
'Transitions': [
{
'Days': 90,
'StorageClass': 'STANDARD_IA'
},
{
'Days': 180,
'StorageClass': 'GLACIER_INSTANT_RETRIEVAL'
}
],
'Filter': {'Prefix': 'photos/'}
}
]
}
s3.put_bucket_lifecycle_configuration(
Bucket='photo-app-uploads',
LifecycleConfiguration=lifecycle_config
)
- ElastiCache Redis for session management and frequently accessed metadata:
import redis
import json
## Connect to ElastiCache Redis
redis_client = redis.Redis(
host='my-cluster.cache.amazonaws.com',
port=6379,
decode_responses=True
)
def cache_user_profile(user_id, profile_data):
key = f"user:{user_id}:profile"
redis_client.setex(
key,
3600, # TTL: 1 hour
json.dumps(profile_data)
)
def get_cached_profile(user_id):
key = f"user:{user_id}:profile"
cached = redis_client.get(key)
return json.loads(cached) if cached else None
- RDS MySQL for structured user data with read replicas:
import pymysql
## Primary database connection
primary_db = pymysql.connect(
host='mydb.cluster-abc123.us-east-1.rds.amazonaws.com',
user='admin',
password='SecurePass123!',
database='photoapp'
)
## Read replica connection for queries
replica_db = pymysql.connect(
host='mydb.cluster-ro-abc123.us-east-1.rds.amazonaws.com',
user='admin',
password='SecurePass123!',
database='photoapp'
)
def get_user_photos(user_id):
cursor = replica_db.cursor()
cursor.execute(
"SELECT photo_id, s3_key, created_at FROM photos WHERE user_id = %s ORDER BY created_at DESC",
(user_id,)
)
return cursor.fetchall()
Cost Optimization Strategy:
- Photos older than 90 days β S3 Standard-IA (50% cost reduction)
- Photos older than 180 days β Glacier Instant Retrieval (68% cost reduction)
- Cache frequently accessed data in Redis (reduce database load)
- Use read replicas for read-heavy operations
Example 2: Data Migration from On-Premises to AWS π¦
Scenario: Migrate 50TB of on-premises data to AWS, with ongoing synchronization during the migration period.
Migration Strategy:
MIGRATION PHASES
Phase 1: Bulk Transfer (Snowball) Phase 2: Sync (DataSync)
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β On-Premises Data Center β β On-Premises (Active) β
β β β β
β ββββββββββββββββββββββ β β ββββββββββββββββββ β
β β 50 TB Storage β β β β New Data β β
β ββββββββββ¬ββββββββββββ β β ββββββββββ¬ββββββββ β
β β β β β β
β βΌ β β βΌ β
β ββββββββββββββββββββββ β β ββββββββββββββββββ β
β β Snowball Device β β β β DataSync ββββββ
β ββββββββββββββββββββββ β β β Agent β β
β β β β ββββββββββββββββββ β
βββββββββββββΌββββββββββββββββββ βββββββββββββββββββββββββββ
β Ship β HTTPS
βΌ βΌ
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β AWS Cloud β β AWS Cloud β
β ββββββββββββββββββββββ β β ββββββββββββββββββ β
β β Import to S3 β β β β Sync to S3 β β
β ββββββββββββββββββββββ β β ββββββββββββββββββ β
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
(1-2 weeks, offline) (Ongoing, online)
Step-by-Step Implementation:
Step 1: Order and configure Snowball
## AWS CLI: Create Snowball job
aws snowball create-job \
--job-type IMPORT \
--resources S3Resources=[{BucketArn=arn:aws:s3:::my-migration-bucket}] \
--address-id ADID1234-5678-90ab-cdef-1234567890ab \
--shipping-option SECOND_DAY \
--snowball-capacity-preference T80
Step 2: Copy data to Snowball (on-premises)
## On-premises: Using Snowball client
snowball cp /mnt/data/ s3://my-migration-bucket/ --recursive
Step 3: Ship Snowball back to AWS
- AWS receives device
- Data automatically imported to S3
- Snowball securely erased
Step 4: Set up DataSync for ongoing sync
import boto3
datasync = boto3.client('datasync')
## Create source location (on-premises)
source_response = datasync.create_location_smb(
ServerHostname='10.0.1.50',
Subdirectory='/data',
User='datasync-user',
Password='SecurePassword123!',
AgentArns=['arn:aws:datasync:us-east-1:123456789012:agent/agent-abc123']
)
## Create destination location (S3)
dest_response = datasync.create_location_s3(
S3BucketArn='arn:aws:s3:::my-migration-bucket',
S3Config={
'BucketAccessRoleArn': 'arn:aws:iam::123456789012:role/DataSyncS3Role'
}
)
## Create sync task
task_response = datasync.create_task(
SourceLocationArn=source_response['LocationArn'],
DestinationLocationArn=dest_response['LocationArn'],
Options={
'VerifyMode': 'POINT_IN_TIME_CONSISTENT',
'TransferMode': 'CHANGED',
'PreserveDeletedFiles': 'PRESERVE'
},
Schedule={
'ScheduleExpression': 'rate(1 hour)'
}
)
print(f"Migration task created: {task_response['TaskArn']}")
Step 5: Monitor and validate
## Check task execution status
executions = datasync.list_task_executions(
TaskArn=task_response['TaskArn']
)
for execution in executions['TaskExecutions']:
details = datasync.describe_task_execution(
TaskExecutionArn=execution['TaskExecutionArn']
)
print(f"Status: {details['Status']}")
print(f"Files transferred: {details['FilesTransferred']}")
print(f"Bytes transferred: {details['BytesTransferred']}")
Example 3: Serverless Data Processing Pipeline π
Scenario: Process uploaded CSV files, transform data, and load into a data warehouse for analytics.
Architecture:
SERVERLESS DATA PIPELINE
ββββββββββββ
β User β
β Upload β
ββββββ¬ββββββ
β
βΌ
βββββββββββββββ ββββββββββββββββ
β S3 Bucket βββββββββββ Lambda β S3 Event
β (raw/) β Event β (Transform) β Trigger
βββββββββββββββ ββββββββ¬ββββββββ
β
βΌ
ββββββββββββββββ
β S3 Bucket β
β (processed/) β
ββββββββ¬ββββββββ
β
βΌ
ββββββββββββββββ
β Lambda β
β (Load) β
ββββββββ¬ββββββββ
β
βΌ
ββββββββββββββββ
β Redshift β
β Data Warehouseβ
ββββββββββββββββ
Implementation:
Lambda Function 1: Transform CSV
import boto3
import csv
import json
from io import StringIO
s3 = boto3.client('s3')
def lambda_handler(event, context):
# Get uploaded file details
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
# Read CSV from S3
response = s3.get_object(Bucket=bucket, Key=key)
csv_content = response['Body'].read().decode('utf-8')
# Transform data
reader = csv.DictReader(StringIO(csv_content))
transformed_data = []
for row in reader:
transformed_row = {
'user_id': row['id'],
'email': row['email'].lower(),
'signup_date': row['created'],
'is_active': row['status'] == 'active'
}
transformed_data.append(transformed_row)
# Write JSON to processed bucket
output_key = key.replace('raw/', 'processed/').replace('.csv', '.json')
s3.put_object(
Bucket='my-data-bucket',
Key=output_key,
Body=json.dumps(transformed_data),
ContentType='application/json'
)
return {
'statusCode': 200,
'body': f'Processed {len(transformed_data)} records'
}
Lambda Function 2: Load to Redshift
import boto3
import psycopg2
import json
s3 = boto3.client('s3')
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
# Connect to Redshift
conn = psycopg2.connect(
host='my-cluster.abc123.us-east-1.redshift.amazonaws.com',
port=5439,
dbname='analytics',
user='admin',
password='SecurePass123!'
)
cursor = conn.cursor()
# Use COPY command for efficient bulk load
copy_sql = f"""
COPY users_staging
FROM 's3://{bucket}/{key}'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyRole'
JSON 'auto'
TIMEFORMAT 'auto';
"""
cursor.execute(copy_sql)
conn.commit()
cursor.close()
conn.close()
return {
'statusCode': 200,
'body': 'Data loaded to Redshift'
}
S3 Event Configuration:
import boto3
s3 = boto3.client('s3')
lambda_client = boto3.client('lambda')
## Grant S3 permission to invoke Lambda
lambda_client.add_permission(
FunctionName='transform-csv-function',
StatementId='s3-invoke-permission',
Action='lambda:InvokeFunction',
Principal='s3.amazonaws.com',
SourceArn='arn:aws:s3:::my-data-bucket'
)
## Configure S3 event notification
s3.put_bucket_notification_configuration(
Bucket='my-data-bucket',
NotificationConfiguration={
'LambdaFunctionConfigurations': [
{
'LambdaFunctionArn': 'arn:aws:lambda:us-east-1:123456789012:function:transform-csv-function',
'Events': ['s3:ObjectCreated:*'],
'Filter': {
'Key': {
'FilterRules': [
{'Name': 'prefix', 'Value': 'raw/'},
{'Name': 'suffix', 'Value': '.csv'}
]
}
}
}
]
}
)
Example 4: Disaster Recovery with Cross-Region Replication π
Scenario: Implement a disaster recovery strategy with RPO (Recovery Point Objective) of 15 minutes and RTO (Recovery Time Objective) of 1 hour.
Architecture:
CROSS-REGION DISASTER RECOVERY
βββββββββββββββββββββββββββββββββββββββββββ
β PRIMARY REGION (us-east-1) β
β β
β ββββββββββββββββ ββββββββββββββββ β
β β RDS Primary ββββββ RDS Replica β β
β β Multi-AZ β β (Read) β β
β ββββββββ¬ββββββββ ββββββββββββββββ β
β β Snapshots β
β βΌ β
β ββββββββββββββββ β
β β S3 Bucket ββββββ Automated β
β β (Primary) β Snapshots β
β ββββββββ¬ββββββββ β
βββββββββββΌββββββββββββββββββββββββββββββ
β CRR (Cross-Region
β Replication)
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β BACKUP REGION (us-west-2) β
β β
β ββββββββββββββββ ββββββββββββββββ β
β β S3 Bucket β β RDS Standby β β
β β (Replica) ββββββ (Restored β β
β β β β from snap) β β
β ββββββββββββββββ ββββββββββββββββ β
β β
β ββββββββββββββββ β
β β DynamoDB ββββββ Global Tables β
β β Replica β β
β ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββ
S3 Cross-Region Replication:
import boto3
s3 = boto3.client('s3')
## Enable versioning (required for CRR)
s3.put_bucket_versioning(
Bucket='my-primary-bucket',
VersioningConfiguration={'Status': 'Enabled'}
)
s3.put_bucket_versioning(
Bucket='my-backup-bucket',
VersioningConfiguration={'Status': 'Enabled'}
)
## Configure cross-region replication
replication_config = {
'Role': 'arn:aws:iam::123456789012:role/S3ReplicationRole',
'Rules': [
{
'ID': 'ReplicateEverything',
'Status': 'Enabled',
'Priority': 1,
'Filter': {},
'Destination': {
'Bucket': 'arn:aws:s3:::my-backup-bucket',
'ReplicationTime': {
'Status': 'Enabled',
'Time': {'Minutes': 15}
},
'Metrics': {
'Status': 'Enabled',
'EventThreshold': {'Minutes': 15}
},
'StorageClass': 'STANDARD_IA'
},
'DeleteMarkerReplication': {'Status': 'Enabled'}
}
]
}
s3.put_bucket_replication(
Bucket='my-primary-bucket',
ReplicationConfiguration=replication_config
)
DynamoDB Global Tables:
import boto3
dynamodb = boto3.client('dynamodb')
## Create global table
response = dynamodb.create_global_table(
GlobalTableName='users-global',
ReplicationGroup=[
{'RegionName': 'us-east-1'},
{'RegionName': 'us-west-2'}
]
)
print(f"Global table created: {response['GlobalTableDescription']['GlobalTableName']}")
Automated RDS Snapshot Copy:
import boto3
rds = boto3.client('rds', region_name='us-east-1')
rds_backup = boto3.client('rds', region_name='us-west-2')
def copy_rds_snapshot(snapshot_id):
# Copy snapshot to backup region
response = rds_backup.copy_db_snapshot(
SourceDBSnapshotIdentifier=f'arn:aws:rds:us-east-1:123456789012:snapshot:{snapshot_id}',
TargetDBSnapshotIdentifier=f'{snapshot_id}-backup',
KmsKeyId='arn:aws:kms:us-west-2:123456789012:key/abcd1234',
CopyTags=True
)
return response['DBSnapshot']['DBSnapshotIdentifier']
## Lambda function triggered by RDS snapshot creation event
def lambda_handler(event, context):
snapshot_id = event['detail']['SourceIdentifier']
backup_snapshot = copy_rds_snapshot(snapshot_id)
print(f"Copied snapshot to backup region: {backup_snapshot}")
Common Mistakes β οΈ
1. Not Understanding Storage Class Transition Rules
β Wrong: Transitioning objects from S3 Standard to Glacier Deep Archive in 30 days
## This will fail - minimum 30 days before Glacier, 90 days before Deep Archive
lifecycle_config = {
'Rules': [{
'Transitions': [{
'Days': 30,
'StorageClass': 'DEEP_ARCHIVE' # β Too soon!
}]
}]
}
β Right: Following minimum storage duration requirements
lifecycle_config = {
'Rules': [{
'Transitions': [
{'Days': 30, 'StorageClass': 'STANDARD_IA'},
{'Days': 90, 'StorageClass': 'GLACIER_IR'},
{'Days': 180, 'StorageClass': 'DEEP_ARCHIVE'}
]
}]
}
2. Forgetting EBS Volumes are AZ-Locked
β Wrong: Trying to attach EBS volume from different AZ
## Instance in us-east-1a, volume in us-east-1b
aws ec2 attach-volume \
--volume-id vol-abc123 \
--instance-id i-def456 # β Will fail!
β Right: Create snapshot, then create volume in target AZ
## Create snapshot
aws ec2 create-snapshot --volume-id vol-abc123
## Create volume in correct AZ from snapshot
aws ec2 create-volume \
--snapshot-id snap-xyz789 \
--availability-zone us-east-1a
3. Not Enabling Point-in-Time Recovery for DynamoDB
β Wrong: Relying only on on-demand backups
## Manual backups only
dynamodb.create_backup(
TableName='users',
BackupName='manual-backup'
)
β Right: Enable continuous backups with PITR
## Enable point-in-time recovery
dynamodb.update_continuous_backups(
TableName='users',
PointInTimeRecoverySpecification={
'PointInTimeRecoveryEnabled': True
}
)
## Now you can restore to any second in last 35 days
4. Using Wrong Read Consistency for DynamoDB
β Wrong: Using eventually consistent reads when strong consistency required
## Eventually consistent read (may return stale data)
response = table.get_item(
Key={'user_id': 'user123'}
# ConsistentRead defaults to False
)
β Right: Use strongly consistent reads for critical data
## Strongly consistent read (always returns latest data)
response = table.get_item(
Key={'user_id': 'user123'},
ConsistentRead=True # β
Guarantees latest data
)
5. Not Encrypting Sensitive Data at Rest
β Wrong: Storing sensitive data without encryption
s3.put_object(
Bucket='user-data',
Key='ssn-records.csv',
Body=sensitive_data
# β No encryption specified!
)
β Right: Always encrypt sensitive data
s3.put_object(
Bucket='user-data',
Key='ssn-records.csv',
Body=sensitive_data,
ServerSideEncryption='aws:kms', # β
KMS encryption
SSEKMSKeyId='arn:aws:kms:us-east-1:123456789012:key/abc123'
)
6. Choosing Wrong RDS Instance Size
π§ Memory Device: RAMI - Read patterns, Availability needs, Memory requirements, IOPS demands
β Wrong: Undersizing database instance
- High CPU utilization (>80% sustained)
- Memory swapping
- Connection pool exhaustion
β Right: Monitor CloudWatch metrics and scale appropriately
cloudwatch = boto3.client('cloudwatch')
## Check CPU utilization
response = cloudwatch.get_metric_statistics(
Namespace='AWS/RDS',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'DBInstanceIdentifier', 'Value': 'mydb'}],
StartTime=datetime.utcnow() - timedelta(hours=1),
EndTime=datetime.utcnow(),
Period=300,
Statistics=['Average']
)
avg_cpu = sum(point['Average'] for point in response['Datapoints']) / len(response['Datapoints'])
if avg_cpu > 80:
print("β οΈ Consider scaling up RDS instance!")
7. Not Using S3 Transfer Acceleration for Large Files
β Wrong: Uploading large files directly to S3 from distant regions
## Slow upload from Asia to us-east-1
s3.upload_file('large_video.mp4', 'my-bucket', 'video.mp4')
β Right: Enable Transfer Acceleration for faster uploads
## Enable Transfer Acceleration on bucket
s3.put_bucket_accelerate_configuration(
Bucket='my-bucket',
AccelerateConfiguration={'Status': 'Enabled'}
)
## Use accelerated endpoint
s3_accelerated = boto3.client(
's3',
config=boto3.session.Config(
s3={'use_accelerate_endpoint': True}
)
)
s3_accelerated.upload_file('large_video.mp4', 'my-bucket', 'video.mp4')
## β
Up to 50-500% faster from distant locations
Key Takeaways π―
π AWS Storage & Data Services Quick Reference
| Service | Type | Best For | Key Feature |
|---|---|---|---|
| S3 | Object Storage | Static content, backups, data lakes | 11 nines durability |
| EBS | Block Storage | EC2 boot/data volumes | Snapshots, encryption |
| EFS | File Storage | Shared file systems | Multi-instance access |
| RDS | Relational DB | Structured data, transactions | Automated backups, Multi-AZ |
| Aurora | Cloud-Native DB | High-performance SQL | 5x MySQL performance |
| DynamoDB | NoSQL DB | Key-value, millisecond latency | Global Tables |
| Redshift | Data Warehouse | Analytics, BI | Columnar storage, MPP |
| ElastiCache | In-Memory Cache | Session storage, caching | Microsecond latency |
| Snow Family | Physical Transfer | Large-scale migrations | Offline data transfer |
| DataSync | Online Transfer | Automated sync | 10x faster than open-source |
π Study Tips:
- S3 lifecycle transitions: Standard β Standard-IA (30d) β Glacier (90d) β Deep Archive (180d)
- EBS vs EFS: EBS = one instance, EFS = many instances
- RDS Multi-AZ: Synchronous replication for high availability
- RDS Read Replicas: Asynchronous replication for read scaling
- DynamoDB consistency: Eventually consistent (default) vs Strongly consistent
- Choose Snowball when: Data > 10TB or network transfer time > shipping time
π§ Try This: Hands-On Practice
Challenge 1: Create an S3 bucket with lifecycle policies
aws s3 mb s3://my-practice-bucket-$(date +%s)
aws s3api put-bucket-lifecycle-configuration \
--bucket my-practice-bucket-* \
--lifecycle-configuration file://lifecycle.json
Challenge 2: Launch an RDS instance with read replica
## Create primary
aws rds create-db-instance \
--db-instance-identifier practice-db-primary \
--db-instance-class db.t3.micro \
--engine mysql \
--master-username admin \
--master-user-password TempPass123!
## Create read replica (after primary is available)
aws rds create-db-instance-read-replica \
--db-instance-identifier practice-db-replica \
--source-db-instance-identifier practice-db-primary
Challenge 3: Set up DynamoDB table with auto-scaling
import boto3
dynamodb = boto3.client('dynamodb')
application_autoscaling = boto3.client('application-autoscaling')
## Create table
table = dynamodb.create_table(
TableName='practice-table',
KeySchema=[{'AttributeName': 'id', 'KeyType': 'HASH'}],
AttributeDefinitions=[{'AttributeName': 'id', 'AttributeType': 'S'}],
BillingMode='PROVISIONED',
ProvisionedThroughput={'ReadCapacityUnits': 5, 'WriteCapacityUnits': 5}
)
## Configure auto-scaling
application_autoscaling.register_scalable_target(
ServiceNamespace='dynamodb',
ResourceId='table/practice-table',
ScalableDimension='dynamodb:table:WriteCapacityUnits',
MinCapacity=5,
MaxCapacity=100
)
π Further Study
- AWS Storage Services Overview - Official AWS storage service documentation
- Amazon S3 Best Practices - Performance and security optimization guide
- AWS Database Migration Service Guide - Comprehensive migration strategies and tools
π Congratulations! You now understand AWS storage and data services. Practice with free flashcards above, and experiment with the AWS Free Tier to solidify your knowledge. Remember: choosing the right storage service depends on your access patterns, durability requirements, and cost constraints!