Container Orchestration
ECS vs EKS vs Fargate comparison, when to use each, and container deployment strategies
Container Orchestration on AWS
Master container orchestration with free flashcards and spaced repetition practice. This lesson covers Amazon ECS, EKS, Fargate, service discovery, and scaling strategiesβessential concepts for building production-ready containerized applications on AWS.
Welcome to Container Orchestration
π» Container orchestration transforms how we deploy and manage applications at scale. While Docker packages your application into containers, orchestration platforms like Amazon ECS (Elastic Container Service) and Amazon EKS (Elastic Kubernetes Service) handle the complex tasks of scheduling, scaling, health monitoring, and service discovery across fleets of containers.
Think of containers as shipping containers and orchestration as the port management system. Just as ports coordinate where containers go, how they're loaded onto ships, and when they arrive, container orchestration platforms coordinate where your application containers run, how they scale, and how they communicate.
π Why Container Orchestration Matters:
- Automated deployment - Deploy hundreds of containers with a single command
- Self-healing - Automatically replace failed containers
- Dynamic scaling - Add/remove containers based on demand
- Service discovery - Containers find each other automatically
- Rolling updates - Update applications with zero downtime
- Resource optimization - Pack containers efficiently across hosts
Core Concepts
Container Orchestration Fundamentals
Container orchestration is the automated management of containerized application lifecycles. On AWS, you have three primary options:
| Service | Best For | Control Level | Learning Curve |
|---|---|---|---|
| Amazon ECS | AWS-native apps | High | Low |
| Amazon EKS | Kubernetes workloads | Very High | High |
| AWS Fargate | Serverless containers | Low | Very Low |
Amazon ECS Architecture
Amazon ECS is AWS's proprietary container orchestration service. It uses familiar AWS concepts and integrates seamlessly with other AWS services.
ββββββββββββββββββββββββββββββββββββββββββββββββββ β ECS ARCHITECTURE β ββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β ββββββββββββββββ ββββββββββββββββ β β β CLUSTER β β SERVICE β β β β (Logical βββββββββββ (Desired β β β β Group) β β State) β β β ββββββββββββββββ ββββββββ¬ββββββββ β β β β β β βΌ βΌ β β ββββββββββββββββ ββββββββββββββββ β β β CONTAINER βββββββββββ TASK β β β β INSTANCES β β (Container β β β β (EC2/ β β Definition) β β β β Fargate) β ββββββββββββββββ β β ββββββββββββββββ β β β β β βΌ β β ββββββββββββββββ β β β CONTAINERS β β β β π³ π³ π³ β β β ββββββββββββββββ β ββββββββββββββββββββββββββββββββββββββββββββββββββ
Key ECS Components:
- Cluster - Logical grouping of container instances (EC2 or Fargate)
- Task Definition - Blueprint describing your containers (like a Dockerfile for orchestration)
- Task - Running instance of a task definition
- Service - Maintains desired number of tasks, handles load balancing
- Container Instance - EC2 instance running the ECS agent (not needed with Fargate)
π‘ Memory Aid - CTTSC: Cluster holds Task definitions that run Tasks via Services on Container instances.
Task Definitions
A task definition is a JSON blueprint that describes:
- Which container images to use
- CPU and memory requirements
- Networking mode
- IAM roles
- Environment variables
- Volume mounts
Here's a basic task definition structure:
{
"family": "web-app",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "256",
"memory": "512",
"containerDefinitions": [
{
"name": "web-container",
"image": "nginx:latest",
"portMappings": [
{
"containerPort": 80,
"protocol": "tcp"
}
],
"essential": true,
"environment": [
{
"name": "ENV",
"value": "production"
}
]
}
]
}
Important fields:
family- Groups related task definition versionsnetworkMode- How containers communicate (awsvpcgives each task its own ENI)requiresCompatibilities- EC2, Fargate, or bothcpu/memory- Resource allocations (in Fargate units)essential- If true, task stops when this container stops
Amazon EKS (Elastic Kubernetes Service)
Amazon EKS runs upstream Kubernetes on AWS, giving you full Kubernetes compatibility. It manages the control plane (master nodes) while you manage worker nodes.
ββββββββββββββββββββββββββββββββββββββββββββββββββ β EKS ARCHITECTURE β ββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β ββββββββββββββββββββββββββββββ β β β CONTROL PLANE β β β β (AWS Managed) β β β β ββββββββββββ β β β β β API β β β β β β Server β β β β β ββββββ¬ββββββ β β β β ββββββ΄ββββββ β β β β β etcd β β β β β β Schedulerβ β β β β β Controller β β β β ββββββββββββ β β β ββββββββββββ¬ββββββββββββββββββ β β β β β βΌ β β ββββββββββββββββββββββββββββββββ β β β WORKER NODES (Your VPC) β β β β ββββββββ ββββββββ ββββββββ β β β β Pod β β Pod β β Pod β β β β β π³π³ β β π³π³ β β π³π³ β β β β ββββββββ ββββββββ ββββββββ β β β Node 1 Node 2 Node 3 β β ββββββββββββββββββββββββββββββββ β ββββββββββββββββββββββββββββββββββββββββββββββββββ
Kubernetes vs ECS Terminology:
| ECS Term | Kubernetes Term | Description |
|---|---|---|
| Task | Pod | Group of containers running together |
| Service | Deployment + Service | Manages replicas and exposes them |
| Task Definition | Pod Spec | Container configuration blueprint |
| Cluster | Cluster | Group of compute resources |
AWS Fargate: Serverless Containers
AWS Fargate is a serverless compute engine for containers. You don't manage EC2 instancesβjust define your containers and Fargate handles the infrastructure.
π― Key Benefits:
- No server management - AWS provisions and scales compute
- Pay per use - Charged only for vCPU and memory consumed
- Better security - Task-level isolation, each task has its own kernel
- Right-sizing - Precise resource allocation per task
Fargate Launch Type vs EC2 Launch Type:
| Aspect | Fargate | EC2 |
|---|---|---|
| Server Management | β Fully managed | β You manage EC2 instances |
| Scaling | β Instant, per-task | β οΈ Must scale instances first |
| Pricing | Per vCPU-second + memory | EC2 instance pricing |
| Use Case | Variable workloads, quick starts | Consistent workloads, cost optimization |
| Control | Limited (no host access) | Full (SSH, custom AMIs) |
Service Discovery
Service discovery enables containers to find and communicate with each other automatically. AWS provides two mechanisms:
1. AWS Cloud Map
Cloud Map creates a service registry integrated with Route 53:
import boto3
servicediscovery = boto3.client('servicediscovery')
## Create namespace
namespace = servicediscovery.create_private_dns_namespace(
Name='internal.example.com',
Vpc='vpc-12345678'
)
## Create service
service = servicediscovery.create_service(
Name='web-api',
DnsConfig={
'DnsRecords': [{'Type': 'A', 'TTL': 60}]
},
HealthCheckCustomConfig={'FailureThreshold': 1}
)
Now containers can call web-api.internal.example.com and Cloud Map routes to healthy instances.
2. Application Load Balancer (ALB) with Target Groups
ALB can route to ECS tasks dynamically:
{
"loadBalancers": [
{
"targetGroupArn": "arn:aws:elasticloadbalancing:...",
"containerName": "web-container",
"containerPort": 80
}
],
"networkConfiguration": {
"awsvpcConfiguration": {
"subnets": ["subnet-12345", "subnet-67890"],
"securityGroups": ["sg-12345"],
"assignPublicIp": "ENABLED"
}
}
}
ECS automatically registers/deregisters tasks from the target group.
Auto Scaling Strategies
Container orchestration enables sophisticated scaling patterns:
1. Service Auto Scaling (Task Level)
Target Tracking Scaling - Maintain a metric at target value:
{
"ServiceName": "web-service",
"ScalableDimension": "ecs:service:DesiredCount",
"PolicyName": "cpu-target-tracking",
"PolicyType": "TargetTrackingScaling",
"TargetTrackingScalingPolicyConfiguration": {
"TargetValue": 75.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
},
"ScaleOutCooldown": 60,
"ScaleInCooldown": 300
}
}
This maintains average CPU at 75% by adding/removing tasks.
2. Cluster Auto Scaling (Infrastructure Level)
Capacity Providers manage cluster capacity automatically:
ecs.put_cluster_capacity_providers(
cluster='production',
capacityProviders=['FARGATE', 'my-capacity-provider'],
defaultCapacityProviderStrategy=[
{
'capacityProvider': 'FARGATE',
'weight': 1,
'base': 2
}
]
)
Capacity provider strategy:
base- Minimum tasks using this providerweight- Relative distribution (e.g., 3:1 ratio between providers)
3. Scheduled Scaling
Scale based on predictable patterns:
scaling = boto3.client('application-autoscaling')
## Scale up for business hours
scaling.put_scheduled_action(
ServiceNamespace='ecs',
ScheduledActionName='scale-up-morning',
ResourceId='service/production/web-app',
ScalableDimension='ecs:service:DesiredCount',
Schedule='cron(0 8 * * ? *)',
ScalableTargetAction={'MinCapacity': 10, 'MaxCapacity': 50}
)
## Scale down after hours
scaling.put_scheduled_action(
ServiceNamespace='ecs',
ScheduledActionName='scale-down-evening',
ResourceId='service/production/web-app',
ScalableDimension='ecs:service:DesiredCount',
Schedule='cron(0 20 * * ? *)',
ScalableTargetAction={'MinCapacity': 2, 'MaxCapacity': 10}
)
SCALING DECISION FLOW
π Metric Threshold Reached
|
βΌ
β±οΈ Cooldown Period Over?
|
ββββββ΄βββββ
βΌ βΌ
β
YES β NO β Wait
|
βΌ
πΊ Scale Action Triggered
|
βΌ
βοΈ Provision/Terminate Tasks
|
βΌ
π Update Desired Count
|
βΌ
β±οΈ Start Cooldown Timer
π‘ Scaling Best Practice: Set scale-out cooldown low (60s) but scale-in cooldown high (300s+). This allows quick response to load increases but prevents flapping during decreases.
Task Networking Modes
ECS supports multiple networking modes:
| Mode | How It Works | Use Case | Fargate Support |
|---|---|---|---|
| awsvpc | Each task gets its own ENI with private IP | Microservices, security groups per task | β Required |
| bridge | Docker bridge on host, port mapping | Simple apps, port conflicts OK | β No |
| host | Direct host network, no isolation | Maximum performance, no port conflicts | β No |
| none | No external networking | Batch jobs, local processing | β No |
awsvpc mode is recommended for production:
{
"networkMode": "awsvpc",
"networkConfiguration": {
"awsvpcConfiguration": {
"subnets": ["subnet-abc123", "subnet-def456"],
"securityGroups": ["sg-web-tier"],
"assignPublicIp": "DISABLED"
}
}
}
Benefits:
- Task-level security groups
- VPC Flow Logs per task
- Direct integration with VPC routing
- Required for Fargate
β οΈ ENI Limit Warning: Each awsvpc task consumes one ENI. Check EC2 instance ENI limits when sizing.
Detailed Examples
Example 1: Deploying a Microservice with ECS and Fargate
Let's deploy a Node.js API with automatic scaling:
Step 1: Create Task Definition
{
"family": "api-service",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "arn:aws:iam::123456789:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::123456789:role/apiTaskRole",
"containerDefinitions": [
{
"name": "api",
"image": "123456789.dkr.ecr.us-east-1.amazonaws.com/api:latest",
"cpu": 512,
"memory": 1024,
"essential": true,
"portMappings": [
{
"containerPort": 3000,
"protocol": "tcp"
}
],
"environment": [
{"name": "NODE_ENV", "value": "production"},
{"name": "PORT", "value": "3000"}
],
"secrets": [
{
"name": "DB_PASSWORD",
"valueFrom": "arn:aws:secretsmanager:us-east-1:123456789:secret:db-password"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/api-service",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "api"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
}
]
}
Key elements explained:
executionRoleArn- IAM role for ECS agent (pull images, write logs)taskRoleArn- IAM role for your application code (access AWS services)secrets- Inject from Secrets Manager (never hardcode passwords)healthCheck- ECS replaces unhealthy containers automaticallylogConfiguration- Send logs to CloudWatch Logs
Step 2: Create ECS Service
import boto3
ecs = boto3.client('ecs')
response = ecs.create_service(
cluster='production',
serviceName='api-service',
taskDefinition='api-service:3',
desiredCount=3,
launchType='FARGATE',
networkConfiguration={
'awsvpcConfiguration': {
'subnets': ['subnet-private1', 'subnet-private2'],
'securityGroups': ['sg-api-tier'],
'assignPublicIp': 'DISABLED'
}
},
loadBalancers=[
{
'targetGroupArn': 'arn:aws:elasticloadbalancing:...',
'containerName': 'api',
'containerPort': 3000
}
],
deploymentConfiguration={
'minimumHealthyPercent': 100,
'maximumPercent': 200,
'deploymentCircuitBreaker': {
'enable': True,
'rollback': True
}
},
enableExecuteCommand=True # Enable ECS Exec for debugging
)
Deployment configuration:
minimumHealthyPercent: 100- Always keep all tasks healthy during deploymentmaximumPercent: 200- Can temporarily run 2x tasks (rolling deployment)deploymentCircuitBreaker- Auto-rollback if deployment fails
Step 3: Configure Auto Scaling
scaling = boto3.client('application-autoscaling')
## Register scalable target
scaling.register_scalable_target(
ServiceNamespace='ecs',
ResourceId='service/production/api-service',
ScalableDimension='ecs:service:DesiredCount',
MinCapacity=3,
MaxCapacity=20
)
## CPU-based scaling
scaling.put_scaling_policy(
PolicyName='cpu-scaling',
ServiceNamespace='ecs',
ResourceId='service/production/api-service',
ScalableDimension='ecs:service:DesiredCount',
PolicyType='TargetTrackingScaling',
TargetTrackingScalingPolicyConfiguration={
'TargetValue': 70.0,
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'ECSServiceAverageCPUUtilization'
},
'ScaleOutCooldown': 60,
'ScaleInCooldown': 300
}
)
## Memory-based scaling
scaling.put_scaling_policy(
PolicyName='memory-scaling',
ServiceNamespace='ecs',
ResourceId='service/production/api-service',
ScalableDimension='ecs:service:DesiredCount',
PolicyType='TargetTrackingScaling',
TargetTrackingScalingPolicyConfiguration={
'TargetValue': 80.0,
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'ECSServiceAverageMemoryUtilization'
},
'ScaleOutCooldown': 60,
'ScaleInCooldown': 300
}
)
## Request-count-based scaling (ALB metric)
scaling.put_scaling_policy(
PolicyName='request-count-scaling',
ServiceNamespace='ecs',
ResourceId='service/production/api-service',
ScalableDimension='ecs:service:DesiredCount',
PolicyType='TargetTrackingScaling',
TargetTrackingScalingPolicyConfiguration={
'TargetValue': 1000.0,
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'ALBRequestCountPerTarget',
'ResourceLabel': 'app/my-alb/xxx/targetgroup/my-tg/yyy'
},
'ScaleOutCooldown': 60,
'ScaleInCooldown': 300
}
)
This creates three scaling policies that work together. AWS uses the one requiring the most capacity at any moment.
Example 2: Blue/Green Deployment with CodeDeploy
Blue/green deployments minimize downtime and enable instant rollback:
## appspec.yaml for CodeDeploy
version: 0.0
Resources:
- TargetService:
Type: AWS::ECS::Service
Properties:
TaskDefinition: "arn:aws:ecs:us-east-1:123456789:task-definition/api-service:4"
LoadBalancerInfo:
ContainerName: "api"
ContainerPort: 3000
PlatformVersion: "LATEST"
Hooks:
- BeforeInstall: "LambdaFunctionToValidateBeforeInstall"
- AfterInstall: "LambdaFunctionToValidateAfterInstall"
- AfterAllowTestTraffic: "LambdaFunctionToTestNewVersion"
- BeforeAllowTraffic: "LambdaFunctionToValidateBeforeTrafficShift"
- AfterAllowTraffic: "LambdaFunctionToValidateAfterTrafficShift"
Deployment flow:
ββββββββββββββββββββββββββββββββββββββββββββββββ β BLUE/GREEN DEPLOYMENT FLOW β ββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β Step 1: π΅ BLUE (Current) β β ALB β Target Group 1 β Blue Tasks β β 100% traffic β β β β Step 2: π’ GREEN (New) Provisioned β β ALB β Target Group 1 β Blue Tasks β β Target Group 2 β Green Tasksβ β 100% β Blue, 0% β Green β β β β Step 3: π Test Traffic (Optional) β β Test listener routes to Green β β Run validation tests β β β β Step 4: β‘ Traffic Shift β β ALB switches to Target Group 2 β β 0% β Blue, 100% β Green β β β β Step 5: ποΈ Cleanup (After wait) β β Terminate Blue tasks β β Green becomes new Blue β β β ββββββββββββββββββββββββββββββββββββββββββββββββ
Configure deployment in CodeDeploy:
codedeploy = boto3.client('codedeploy')
response = codedeploy.create_deployment_group(
applicationName='api-app',
deploymentGroupName='api-prod-dg',
deploymentConfigName='CodeDeployDefault.ECSCanary10Percent5Minutes',
serviceRoleArn='arn:aws:iam::123456789:role/CodeDeployRole',
ecsServices=[
{
'serviceName': 'api-service',
'clusterName': 'production'
}
],
loadBalancerInfo={
'targetGroupPairInfoList': [
{
'targetGroups': [
{'name': 'api-blue-tg'},
{'name': 'api-green-tg'}
],
'prodTrafficRoute': {
'listenerArns': ['arn:aws:elasticloadbalancing:...']
},
'testTrafficRoute': {
'listenerArns': ['arn:aws:elasticloadbalancing:...:8080']
}
}
]
},
blueGreenDeploymentConfiguration={
'terminateBlueInstancesOnDeploymentSuccess': {
'action': 'TERMINATE',
'terminationWaitTimeInMinutes': 5
},
'deploymentReadyOption': {
'actionOnTimeout': 'CONTINUE_DEPLOYMENT'
}
}
)
Deployment configurations:
CodeDeployDefault.ECSLinear10PercentEvery1Minutes- 10% every minuteCodeDeployDefault.ECSCanary10Percent5Minutes- 10% first, wait 5min, then 90%CodeDeployDefault.ECSAllAtOnce- Instant cutover
Example 3: Multi-Container Task with Sidecar Pattern
The sidecar pattern places helper containers alongside your main application:
{
"family": "web-with-logging",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048",
"containerDefinitions": [
{
"name": "web-app",
"image": "my-web-app:latest",
"cpu": 768,
"memory": 1536,
"essential": true,
"portMappings": [{"containerPort": 80}],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/web-app",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "web"
}
},
"dependsOn": [
{
"containerName": "log-router",
"condition": "START"
}
]
},
{
"name": "log-router",
"image": "fluent/fluentd:latest",
"cpu": 256,
"memory": 512,
"essential": false,
"environment": [
{"name": "FLUENTD_CONF", "value": "fluentd.conf"}
],
"mountPoints": [
{
"sourceVolume": "logs",
"containerPath": "/var/log/app"
}
]
}
],
"volumes": [
{
"name": "logs",
"host": {}
}
]
}
Key points:
- Main container marked
essential: true- task stops if it fails - Sidecar marked
essential: false- task continues if it fails dependsOnensures log-router starts before web-app- Shared volume for inter-container communication
Common sidecar use cases:
- Log aggregation (Fluentd, Fluent Bit)
- Service mesh proxy (Envoy, Linkerd)
- Secret management (Vault agent)
- Monitoring agents (Datadog, New Relic)
Example 4: EKS Deployment with kubectl
Deploy a microservice to EKS using Kubernetes manifests:
deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-deployment
namespace: production
labels:
app: api
spec:
replicas: 3
selector:
matchLabels:
app: api
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: api
version: v1
spec:
serviceAccountName: api-service-account
containers:
- name: api
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/api:v1.2.3
ports:
- containerPort: 8080
name: http
protocol: TCP
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
env:
- name: ENV
value: production
- name: DB_HOST
valueFrom:
configMapKeyRef:
name: api-config
key: database.host
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: api-secrets
key: db.password
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
---
apiVersion: v1
kind: Service
metadata:
name: api-service
namespace: production
spec:
type: LoadBalancer
selector:
app: api
ports:
- protocol: TCP
port: 80
targetPort: 8080
sessionAffinity: None
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-deployment
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 4
periodSeconds: 30
selectPolicy: Max
Deploy with:
## Configure kubectl for EKS
aws eks update-kubeconfig --name production-cluster --region us-east-1
## Apply manifests
kubectl apply -f deployment.yaml
## Verify deployment
kubectl get deployments -n production
kubectl get pods -n production
kubectl get hpa -n production
## View logs
kubectl logs -f deployment/api-deployment -n production
## Check scaling events
kubectl describe hpa api-hpa -n production
Kubernetes advantages:
- Declarative configuration (desired state)
- Built-in service discovery (CoreDNS)
- Advanced scheduling (node affinity, taints/tolerations)
- Ecosystem tools (Helm, Istio, Prometheus)
Common Mistakes
β οΈ 1. Insufficient Resource Allocation
Problem: Tasks crash with "OutOfMemory" errors or get throttled.
β Wrong:
{
"cpu": "256",
"memory": "512",
"containerDefinitions": [{
"name": "app",
"memory": 512 // No headroom for JVM, buffers, etc.
}]
}
β Right:
{
"cpu": "512",
"memory": "1024",
"containerDefinitions": [{
"name": "app",
"memory": 768, // 75% of task memory
"memoryReservation": 512 // Soft limit
}]
}
π‘ Tip: Monitor MemoryUtilization CloudWatch metric. Add 25-50% buffer over observed peak usage.
β οΈ 2. Missing Health Checks
Problem: ECS routes traffic to unhealthy containers, causing 5xx errors.
β Wrong:
{
"containerDefinitions": [{
"name": "app",
"portMappings": [{"containerPort": 80}]
// No healthCheck defined!
}]
}
β Right:
{
"containerDefinitions": [{
"name": "app",
"portMappings": [{"containerPort": 80}],
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60 // Grace period for slow startup
}
}]
}
Plus configure ALB target group health check:
elbv2.modify_target_group(
TargetGroupArn='arn:...',
HealthCheckEnabled=True,
HealthCheckPath='/health',
HealthCheckIntervalSeconds=30,
HealthyThresholdCount=2,
UnhealthyThresholdCount=3
)
β οΈ 3. Not Using awsvpc Network Mode
Problem: Can't use security groups per task, limited networking features.
β Wrong:
{
"networkMode": "bridge", // Old default
"containerDefinitions": [{
"portMappings": [{"hostPort": 0, "containerPort": 80}] // Dynamic ports
}]
}
β Right:
{
"networkMode": "awsvpc",
"containerDefinitions": [{
"portMappings": [{"containerPort": 80}] // No hostPort needed
}]
}
Benefits of awsvpc:
- Task-specific security groups
- VPC Flow Logs per task
- Direct ENI attachment
- Required for Fargate
β οΈ 4. Hardcoded Secrets in Task Definitions
Problem: Secrets exposed in console, CloudTrail, version control.
β Wrong:
{
"environment": [
{"name": "DB_PASSWORD", "value": "MySuperSecretPassword123"} // NEVER!
]
}
β Right:
{
"secrets": [
{
"name": "DB_PASSWORD",
"valueFrom": "arn:aws:secretsmanager:us-east-1:123456789:secret:db-password-AbCdEf"
}
],
"executionRoleArn": "arn:aws:iam::123456789:role/ecsTaskExecutionRole" // Must have secretsmanager:GetSecretValue
}
Or use Systems Manager Parameter Store:
{
"secrets": [
{
"name": "DB_PASSWORD",
"valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/prod/db/password"
}
]
}
β οΈ 5. Inadequate Logging Configuration
Problem: Can't troubleshoot issues, no visibility into container behavior.
β Wrong:
{
"containerDefinitions": [{
"name": "app"
// No logConfiguration - logs go nowhere!
}]
}
β Right:
{
"containerDefinitions": [{
"name": "app",
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/my-app",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "app",
"awslogs-datetime-format": "%Y-%m-%d %H:%M:%S"
}
}
}]
}
Create log group first:
aws logs create-log-group --log-group-name /ecs/my-app
aws logs put-retention-policy --log-group-name /ecs/my-app --retention-in-days 7
β οΈ 6. Ignoring Deployment Configuration
Problem: All tasks replaced simultaneously, causing downtime.
β Wrong:
ecs.create_service(
desiredCount=10
# Using defaults: minimumHealthyPercent=100, maximumPercent=200
)
This can cause issues during deployments!
β Right:
ecs.create_service(
desiredCount=10,
deploymentConfiguration={
'minimumHealthyPercent': 100, // Never go below 10 tasks
'maximumPercent': 150, // Can temporarily run 15 tasks (50% overhead)
'deploymentCircuitBreaker': {
'enable': True,
'rollback': True // Auto-rollback on repeated failures
}
}
)
For zero-downtime deployments:
minimumHealthyPercent: 100ensures capacity maintainedmaximumPercent: 200allows full new set before terminating old- Circuit breaker prevents bad deployments from completing
β οΈ 7. Not Using Capacity Providers
Problem: Manual cluster scaling, inefficient resource usage.
β Wrong:
## Manually scaling Auto Scaling Group
autoscaling.set_desired_capacity(
AutoScalingGroupName='ecs-cluster-asg',
DesiredCapacity=10 # Guessing capacity needs
)
β Right:
## Create capacity provider
ecs.create_capacity_provider(
name='my-capacity-provider',
autoScalingGroupProvider={
'autoScalingGroupArn': 'arn:aws:autoscaling:...',
'managedScaling': {
'status': 'ENABLED',
'targetCapacity': 80, // Keep cluster 80% utilized
'minimumScalingStepSize': 1,
'maximumScalingStepSize': 10
},
'managedTerminationProtection': 'ENABLED'
}
)
## Associate with cluster
ecs.put_cluster_capacity_providers(
cluster='production',
capacityProviders=['my-capacity-provider'],
defaultCapacityProviderStrategy=[{
'capacityProvider': 'my-capacity-provider',
'weight': 1,
'base': 2 // Always keep 2 instances minimum
}]
)
Capacity providers automatically scale cluster to meet task demands.
β οΈ 8. Incorrect Service Discovery Configuration
Problem: Services can't find each other, hardcoded IPs break on updates.
β Wrong:
## Hardcoding service endpoints
os.environ['API_URL'] = 'http://10.0.1.45:8080' # IP changes on deployment!
β Right:
## Use Cloud Map service discovery
servicediscovery.create_service(
Name='api-service',
NamespaceId='ns-xxx',
DnsConfig={
'DnsRecords': [{'Type': 'A', 'TTL': 60}],
'RoutingPolicy': 'MULTIVALUE'
},
HealthCheckCustomConfig={'FailureThreshold': 1}
)
## Configure ECS service
ecs.create_service(
serviceName='api',
serviceRegistries=[{
'registryArn': 'arn:aws:servicediscovery:...:service/srv-xxx'
}]
)
## Now reference by name
os.environ['API_URL'] = 'http://api-service.internal.example.com'
Key Takeaways
π― Container Orchestration Essentials:
Choose the right service:
- ECS for AWS-native simplicity
- EKS for Kubernetes portability
- Fargate for serverless operation
Task definitions are blueprints - They define:
- Container images and versions
- Resource allocations (CPU/memory)
- Networking configuration
- IAM roles and permissions
- Environment variables and secrets
Services maintain desired state - They:
- Keep specified number of tasks running
- Integrate with load balancers
- Handle rolling deployments
- Auto-replace failed tasks
Networking matters - Use
awsvpcmode for:- Task-level security groups
- Enhanced monitoring
- Fargate compatibility
- Production workloads
Implement health checks everywhere:
- Container health checks (task-level)
- Target group health checks (ALB-level)
- Application health endpoints
Auto-scale intelligently:
- Target tracking for CPU/memory
- Request-count based for user traffic
- Scheduled scaling for predictable patterns
- Capacity providers for cluster scaling
Security best practices:
- Store secrets in Secrets Manager/Parameter Store
- Use task-specific IAM roles
- Run containers as non-root users
- Scan images for vulnerabilities
Deployment strategies reduce risk:
- Rolling updates for gradual migration
- Blue/green for instant rollback
- Circuit breakers for auto-rollback
- Canary deployments for testing
Observability is critical:
- CloudWatch Logs for application logs
- CloudWatch Container Insights for metrics
- AWS X-Ray for distributed tracing
- ECS Exec for debugging
Cost optimization techniques:
- Use Fargate Spot for fault-tolerant workloads
- Right-size task resources
- Implement auto-scaling
- Use Savings Plans for predictable workloads
π Quick Reference Card
| ECS Cluster | Logical grouping of container instances |
| Task Definition | JSON blueprint for containers |
| Task | Running instance of task definition |
| Service | Maintains desired task count + load balancing |
| Fargate | Serverless compute for containers |
| awsvpc | Network mode giving each task its own ENI |
| Cloud Map | Service discovery via DNS |
| Capacity Provider | Auto-scales cluster infrastructure |
| Target Tracking | Maintains metric at target value |
| Blue/Green | Deployment with instant rollback capability |
π Further Study
Official AWS Documentation:
- Amazon ECS Developer Guide - Comprehensive ECS documentation
- Amazon EKS User Guide - Complete EKS reference
- AWS Fargate User Guide - Fargate-specific configuration
Best Practices:
- ECS Best Practices Guide - Official best practices
- AWS Architecture Center - Containers - Reference architectures
π Continue your AWS journey by exploring service mesh architectures with AWS App Mesh, GitOps workflows with AWS CodePipeline, and cost optimization strategies for containerized workloads!