You are viewing a preview of this lesson. Sign in to start learning
Back to Mastering AWS

Container Orchestration

ECS vs EKS vs Fargate comparison, when to use each, and container deployment strategies

Container Orchestration on AWS

Master container orchestration with free flashcards and spaced repetition practice. This lesson covers Amazon ECS, EKS, Fargate, service discovery, and scaling strategiesβ€”essential concepts for building production-ready containerized applications on AWS.

Welcome to Container Orchestration

πŸ’» Container orchestration transforms how we deploy and manage applications at scale. While Docker packages your application into containers, orchestration platforms like Amazon ECS (Elastic Container Service) and Amazon EKS (Elastic Kubernetes Service) handle the complex tasks of scheduling, scaling, health monitoring, and service discovery across fleets of containers.

Think of containers as shipping containers and orchestration as the port management system. Just as ports coordinate where containers go, how they're loaded onto ships, and when they arrive, container orchestration platforms coordinate where your application containers run, how they scale, and how they communicate.

πŸ” Why Container Orchestration Matters:

  • Automated deployment - Deploy hundreds of containers with a single command
  • Self-healing - Automatically replace failed containers
  • Dynamic scaling - Add/remove containers based on demand
  • Service discovery - Containers find each other automatically
  • Rolling updates - Update applications with zero downtime
  • Resource optimization - Pack containers efficiently across hosts

Core Concepts

Container Orchestration Fundamentals

Container orchestration is the automated management of containerized application lifecycles. On AWS, you have three primary options:

Service Best For Control Level Learning Curve
Amazon ECS AWS-native apps High Low
Amazon EKS Kubernetes workloads Very High High
AWS Fargate Serverless containers Low Very Low

Amazon ECS Architecture

Amazon ECS is AWS's proprietary container orchestration service. It uses familiar AWS concepts and integrates seamlessly with other AWS services.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           ECS ARCHITECTURE                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚   CLUSTER    β”‚         β”‚   SERVICE    β”‚    β”‚
β”‚  β”‚  (Logical    │─────────│  (Desired    β”‚    β”‚
β”‚  β”‚   Group)     β”‚         β”‚   State)     β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚         β”‚                        β”‚            β”‚
β”‚         β–Ό                        β–Ό            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  CONTAINER   │◀────────│    TASK      β”‚    β”‚
β”‚  β”‚  INSTANCES   β”‚         β”‚ (Container   β”‚    β”‚
β”‚  β”‚  (EC2/       β”‚         β”‚  Definition) β”‚    β”‚
β”‚  β”‚   Fargate)   β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
β”‚         β”‚                                     β”‚
β”‚         β–Ό                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
β”‚  β”‚  CONTAINERS  β”‚                             β”‚
β”‚  β”‚  🐳 🐳 🐳    β”‚                             β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key ECS Components:

  1. Cluster - Logical grouping of container instances (EC2 or Fargate)
  2. Task Definition - Blueprint describing your containers (like a Dockerfile for orchestration)
  3. Task - Running instance of a task definition
  4. Service - Maintains desired number of tasks, handles load balancing
  5. Container Instance - EC2 instance running the ECS agent (not needed with Fargate)

πŸ’‘ Memory Aid - CTTSC: Cluster holds Task definitions that run Tasks via Services on Container instances.

Task Definitions

A task definition is a JSON blueprint that describes:

  • Which container images to use
  • CPU and memory requirements
  • Networking mode
  • IAM roles
  • Environment variables
  • Volume mounts

Here's a basic task definition structure:

{
  "family": "web-app",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "containerDefinitions": [
    {
      "name": "web-container",
      "image": "nginx:latest",
      "portMappings": [
        {
          "containerPort": 80,
          "protocol": "tcp"
        }
      ],
      "essential": true,
      "environment": [
        {
          "name": "ENV",
          "value": "production"
        }
      ]
    }
  ]
}

Important fields:

  • family - Groups related task definition versions
  • networkMode - How containers communicate (awsvpc gives each task its own ENI)
  • requiresCompatibilities - EC2, Fargate, or both
  • cpu/memory - Resource allocations (in Fargate units)
  • essential - If true, task stops when this container stops

Amazon EKS (Elastic Kubernetes Service)

Amazon EKS runs upstream Kubernetes on AWS, giving you full Kubernetes compatibility. It manages the control plane (master nodes) while you manage worker nodes.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           EKS ARCHITECTURE                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚  β”‚    CONTROL PLANE           β”‚               β”‚
β”‚  β”‚  (AWS Managed)             β”‚               β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚               β”‚
β”‚  β”‚  β”‚ API      β”‚              β”‚               β”‚
β”‚  β”‚  β”‚ Server   β”‚              β”‚               β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜              β”‚               β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”              β”‚               β”‚
β”‚  β”‚  β”‚ etcd     β”‚              β”‚               β”‚
β”‚  β”‚  β”‚ Schedulerβ”‚              β”‚               β”‚
β”‚  β”‚  β”‚ Controller              β”‚               β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚               β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚             β”‚                                 β”‚
β”‚             β–Ό                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚  β”‚    WORKER NODES (Your VPC)   β”‚             β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”             β”‚
β”‚  β”‚  β”‚ Pod  β”‚  β”‚ Pod  β”‚  β”‚ Pod  β”‚             β”‚
β”‚  β”‚  β”‚ 🐳🐳 β”‚  β”‚ 🐳🐳 β”‚  β”‚ 🐳🐳 β”‚             β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”˜             β”‚
β”‚  β”‚   Node 1    Node 2    Node 3              β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Kubernetes vs ECS Terminology:

ECS Term Kubernetes Term Description
Task Pod Group of containers running together
Service Deployment + Service Manages replicas and exposes them
Task Definition Pod Spec Container configuration blueprint
Cluster Cluster Group of compute resources

AWS Fargate: Serverless Containers

AWS Fargate is a serverless compute engine for containers. You don't manage EC2 instancesβ€”just define your containers and Fargate handles the infrastructure.

🎯 Key Benefits:

  • No server management - AWS provisions and scales compute
  • Pay per use - Charged only for vCPU and memory consumed
  • Better security - Task-level isolation, each task has its own kernel
  • Right-sizing - Precise resource allocation per task

Fargate Launch Type vs EC2 Launch Type:

Aspect Fargate EC2
Server Management βœ… Fully managed ❌ You manage EC2 instances
Scaling βœ… Instant, per-task ⚠️ Must scale instances first
Pricing Per vCPU-second + memory EC2 instance pricing
Use Case Variable workloads, quick starts Consistent workloads, cost optimization
Control Limited (no host access) Full (SSH, custom AMIs)

Service Discovery

Service discovery enables containers to find and communicate with each other automatically. AWS provides two mechanisms:

1. AWS Cloud Map

Cloud Map creates a service registry integrated with Route 53:

import boto3

servicediscovery = boto3.client('servicediscovery')

## Create namespace
namespace = servicediscovery.create_private_dns_namespace(
    Name='internal.example.com',
    Vpc='vpc-12345678'
)

## Create service
service = servicediscovery.create_service(
    Name='web-api',
    DnsConfig={
        'DnsRecords': [{'Type': 'A', 'TTL': 60}]
    },
    HealthCheckCustomConfig={'FailureThreshold': 1}
)

Now containers can call web-api.internal.example.com and Cloud Map routes to healthy instances.

2. Application Load Balancer (ALB) with Target Groups

ALB can route to ECS tasks dynamically:

{
  "loadBalancers": [
    {
      "targetGroupArn": "arn:aws:elasticloadbalancing:...",
      "containerName": "web-container",
      "containerPort": 80
    }
  ],
  "networkConfiguration": {
    "awsvpcConfiguration": {
      "subnets": ["subnet-12345", "subnet-67890"],
      "securityGroups": ["sg-12345"],
      "assignPublicIp": "ENABLED"
    }
  }
}

ECS automatically registers/deregisters tasks from the target group.

Auto Scaling Strategies

Container orchestration enables sophisticated scaling patterns:

1. Service Auto Scaling (Task Level)

Target Tracking Scaling - Maintain a metric at target value:

{
  "ServiceName": "web-service",
  "ScalableDimension": "ecs:service:DesiredCount",
  "PolicyName": "cpu-target-tracking",
  "PolicyType": "TargetTrackingScaling",
  "TargetTrackingScalingPolicyConfiguration": {
    "TargetValue": 75.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 300
  }
}

This maintains average CPU at 75% by adding/removing tasks.

2. Cluster Auto Scaling (Infrastructure Level)

Capacity Providers manage cluster capacity automatically:

ecs.put_cluster_capacity_providers(
    cluster='production',
    capacityProviders=['FARGATE', 'my-capacity-provider'],
    defaultCapacityProviderStrategy=[
        {
            'capacityProvider': 'FARGATE',
            'weight': 1,
            'base': 2
        }
    ]
)

Capacity provider strategy:

  • base - Minimum tasks using this provider
  • weight - Relative distribution (e.g., 3:1 ratio between providers)
3. Scheduled Scaling

Scale based on predictable patterns:

scaling = boto3.client('application-autoscaling')

## Scale up for business hours
scaling.put_scheduled_action(
    ServiceNamespace='ecs',
    ScheduledActionName='scale-up-morning',
    ResourceId='service/production/web-app',
    ScalableDimension='ecs:service:DesiredCount',
    Schedule='cron(0 8 * * ? *)',
    ScalableTargetAction={'MinCapacity': 10, 'MaxCapacity': 50}
)

## Scale down after hours
scaling.put_scheduled_action(
    ServiceNamespace='ecs',
    ScheduledActionName='scale-down-evening',
    ResourceId='service/production/web-app',
    ScalableDimension='ecs:service:DesiredCount',
    Schedule='cron(0 20 * * ? *)',
    ScalableTargetAction={'MinCapacity': 2, 'MaxCapacity': 10}
)
SCALING DECISION FLOW

    πŸ“Š Metric Threshold Reached
           |
           β–Ό
    ⏱️ Cooldown Period Over?
           |
      β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
      β–Ό         β–Ό
    βœ… YES    ❌ NO β†’ Wait
      |         
      β–Ό         
    πŸ”Ί Scale Action Triggered
      |
      β–Ό
    βš™οΈ Provision/Terminate Tasks
      |
      β–Ό
    πŸ”„ Update Desired Count
      |
      β–Ό
    ⏱️ Start Cooldown Timer

πŸ’‘ Scaling Best Practice: Set scale-out cooldown low (60s) but scale-in cooldown high (300s+). This allows quick response to load increases but prevents flapping during decreases.

Task Networking Modes

ECS supports multiple networking modes:

Mode How It Works Use Case Fargate Support
awsvpc Each task gets its own ENI with private IP Microservices, security groups per task βœ… Required
bridge Docker bridge on host, port mapping Simple apps, port conflicts OK ❌ No
host Direct host network, no isolation Maximum performance, no port conflicts ❌ No
none No external networking Batch jobs, local processing ❌ No

awsvpc mode is recommended for production:

{
  "networkMode": "awsvpc",
  "networkConfiguration": {
    "awsvpcConfiguration": {
      "subnets": ["subnet-abc123", "subnet-def456"],
      "securityGroups": ["sg-web-tier"],
      "assignPublicIp": "DISABLED"
    }
  }
}

Benefits:

  • Task-level security groups
  • VPC Flow Logs per task
  • Direct integration with VPC routing
  • Required for Fargate

⚠️ ENI Limit Warning: Each awsvpc task consumes one ENI. Check EC2 instance ENI limits when sizing.

Detailed Examples

Example 1: Deploying a Microservice with ECS and Fargate

Let's deploy a Node.js API with automatic scaling:

Step 1: Create Task Definition

{
  "family": "api-service",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789:role/apiTaskRole",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/api:latest",
      "cpu": 512,
      "memory": 1024,
      "essential": true,
      "portMappings": [
        {
          "containerPort": 3000,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {"name": "NODE_ENV", "value": "production"},
        {"name": "PORT", "value": "3000"}
      ],
      "secrets": [
        {
          "name": "DB_PASSWORD",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789:secret:db-password"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/api-service",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "api"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ]
}

Key elements explained:

  • executionRoleArn - IAM role for ECS agent (pull images, write logs)
  • taskRoleArn - IAM role for your application code (access AWS services)
  • secrets - Inject from Secrets Manager (never hardcode passwords)
  • healthCheck - ECS replaces unhealthy containers automatically
  • logConfiguration - Send logs to CloudWatch Logs

Step 2: Create ECS Service

import boto3

ecs = boto3.client('ecs')

response = ecs.create_service(
    cluster='production',
    serviceName='api-service',
    taskDefinition='api-service:3',
    desiredCount=3,
    launchType='FARGATE',
    networkConfiguration={
        'awsvpcConfiguration': {
            'subnets': ['subnet-private1', 'subnet-private2'],
            'securityGroups': ['sg-api-tier'],
            'assignPublicIp': 'DISABLED'
        }
    },
    loadBalancers=[
        {
            'targetGroupArn': 'arn:aws:elasticloadbalancing:...',
            'containerName': 'api',
            'containerPort': 3000
        }
    ],
    deploymentConfiguration={
        'minimumHealthyPercent': 100,
        'maximumPercent': 200,
        'deploymentCircuitBreaker': {
            'enable': True,
            'rollback': True
        }
    },
    enableExecuteCommand=True  # Enable ECS Exec for debugging
)

Deployment configuration:

  • minimumHealthyPercent: 100 - Always keep all tasks healthy during deployment
  • maximumPercent: 200 - Can temporarily run 2x tasks (rolling deployment)
  • deploymentCircuitBreaker - Auto-rollback if deployment fails

Step 3: Configure Auto Scaling

scaling = boto3.client('application-autoscaling')

## Register scalable target
scaling.register_scalable_target(
    ServiceNamespace='ecs',
    ResourceId='service/production/api-service',
    ScalableDimension='ecs:service:DesiredCount',
    MinCapacity=3,
    MaxCapacity=20
)

## CPU-based scaling
scaling.put_scaling_policy(
    PolicyName='cpu-scaling',
    ServiceNamespace='ecs',
    ResourceId='service/production/api-service',
    ScalableDimension='ecs:service:DesiredCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 70.0,
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'ECSServiceAverageCPUUtilization'
        },
        'ScaleOutCooldown': 60,
        'ScaleInCooldown': 300
    }
)

## Memory-based scaling
scaling.put_scaling_policy(
    PolicyName='memory-scaling',
    ServiceNamespace='ecs',
    ResourceId='service/production/api-service',
    ScalableDimension='ecs:service:DesiredCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 80.0,
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'ECSServiceAverageMemoryUtilization'
        },
        'ScaleOutCooldown': 60,
        'ScaleInCooldown': 300
    }
)

## Request-count-based scaling (ALB metric)
scaling.put_scaling_policy(
    PolicyName='request-count-scaling',
    ServiceNamespace='ecs',
    ResourceId='service/production/api-service',
    ScalableDimension='ecs:service:DesiredCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 1000.0,
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'ALBRequestCountPerTarget',
            'ResourceLabel': 'app/my-alb/xxx/targetgroup/my-tg/yyy'
        },
        'ScaleOutCooldown': 60,
        'ScaleInCooldown': 300
    }
)

This creates three scaling policies that work together. AWS uses the one requiring the most capacity at any moment.

Example 2: Blue/Green Deployment with CodeDeploy

Blue/green deployments minimize downtime and enable instant rollback:

## appspec.yaml for CodeDeploy
version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: "arn:aws:ecs:us-east-1:123456789:task-definition/api-service:4"
        LoadBalancerInfo:
          ContainerName: "api"
          ContainerPort: 3000
        PlatformVersion: "LATEST"
Hooks:
  - BeforeInstall: "LambdaFunctionToValidateBeforeInstall"
  - AfterInstall: "LambdaFunctionToValidateAfterInstall"
  - AfterAllowTestTraffic: "LambdaFunctionToTestNewVersion"
  - BeforeAllowTraffic: "LambdaFunctionToValidateBeforeTrafficShift"
  - AfterAllowTraffic: "LambdaFunctionToValidateAfterTrafficShift"

Deployment flow:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚       BLUE/GREEN DEPLOYMENT FLOW             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                              β”‚
β”‚  Step 1: πŸ”΅ BLUE (Current)                  β”‚
β”‚          ALB β†’ Target Group 1 β†’ Blue Tasks  β”‚
β”‚          100% traffic                        β”‚
β”‚                                              β”‚
β”‚  Step 2: 🟒 GREEN (New) Provisioned         β”‚
β”‚          ALB β†’ Target Group 1 β†’ Blue Tasks  β”‚
β”‚                 Target Group 2 β†’ Green Tasksβ”‚
β”‚          100% β†’ Blue, 0% β†’ Green            β”‚
β”‚                                              β”‚
β”‚  Step 3: πŸ” Test Traffic (Optional)         β”‚
β”‚          Test listener routes to Green      β”‚
β”‚          Run validation tests               β”‚
β”‚                                              β”‚
β”‚  Step 4: ⚑ Traffic Shift                   β”‚
β”‚          ALB switches to Target Group 2     β”‚
β”‚          0% β†’ Blue, 100% β†’ Green            β”‚
β”‚                                              β”‚
β”‚  Step 5: πŸ—‘οΈ Cleanup (After wait)           β”‚
β”‚          Terminate Blue tasks               β”‚
β”‚          Green becomes new Blue             β”‚
β”‚                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configure deployment in CodeDeploy:

codedeploy = boto3.client('codedeploy')

response = codedeploy.create_deployment_group(
    applicationName='api-app',
    deploymentGroupName='api-prod-dg',
    deploymentConfigName='CodeDeployDefault.ECSCanary10Percent5Minutes',
    serviceRoleArn='arn:aws:iam::123456789:role/CodeDeployRole',
    ecsServices=[
        {
            'serviceName': 'api-service',
            'clusterName': 'production'
        }
    ],
    loadBalancerInfo={
        'targetGroupPairInfoList': [
            {
                'targetGroups': [
                    {'name': 'api-blue-tg'},
                    {'name': 'api-green-tg'}
                ],
                'prodTrafficRoute': {
                    'listenerArns': ['arn:aws:elasticloadbalancing:...']
                },
                'testTrafficRoute': {
                    'listenerArns': ['arn:aws:elasticloadbalancing:...:8080']
                }
            }
        ]
    },
    blueGreenDeploymentConfiguration={
        'terminateBlueInstancesOnDeploymentSuccess': {
            'action': 'TERMINATE',
            'terminationWaitTimeInMinutes': 5
        },
        'deploymentReadyOption': {
            'actionOnTimeout': 'CONTINUE_DEPLOYMENT'
        }
    }
)

Deployment configurations:

  • CodeDeployDefault.ECSLinear10PercentEvery1Minutes - 10% every minute
  • CodeDeployDefault.ECSCanary10Percent5Minutes - 10% first, wait 5min, then 90%
  • CodeDeployDefault.ECSAllAtOnce - Instant cutover

Example 3: Multi-Container Task with Sidecar Pattern

The sidecar pattern places helper containers alongside your main application:

{
  "family": "web-with-logging",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "containerDefinitions": [
    {
      "name": "web-app",
      "image": "my-web-app:latest",
      "cpu": 768,
      "memory": 1536,
      "essential": true,
      "portMappings": [{"containerPort": 80}],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/web-app",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "web"
        }
      },
      "dependsOn": [
        {
          "containerName": "log-router",
          "condition": "START"
        }
      ]
    },
    {
      "name": "log-router",
      "image": "fluent/fluentd:latest",
      "cpu": 256,
      "memory": 512,
      "essential": false,
      "environment": [
        {"name": "FLUENTD_CONF", "value": "fluentd.conf"}
      ],
      "mountPoints": [
        {
          "sourceVolume": "logs",
          "containerPath": "/var/log/app"
        }
      ]
    }
  ],
  "volumes": [
    {
      "name": "logs",
      "host": {}
    }
  ]
}

Key points:

  • Main container marked essential: true - task stops if it fails
  • Sidecar marked essential: false - task continues if it fails
  • dependsOn ensures log-router starts before web-app
  • Shared volume for inter-container communication

Common sidecar use cases:

  • Log aggregation (Fluentd, Fluent Bit)
  • Service mesh proxy (Envoy, Linkerd)
  • Secret management (Vault agent)
  • Monitoring agents (Datadog, New Relic)

Example 4: EKS Deployment with kubectl

Deploy a microservice to EKS using Kubernetes manifests:

deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-deployment
  namespace: production
  labels:
    app: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: api
        version: v1
    spec:
      serviceAccountName: api-service-account
      containers:
      - name: api
        image: 123456789.dkr.ecr.us-east-1.amazonaws.com/api:v1.2.3
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        env:
        - name: ENV
          value: production
        - name: DB_HOST
          valueFrom:
            configMapKeyRef:
              name: api-config
              key: database.host
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: api-secrets
              key: db.password
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2
---
apiVersion: v1
kind: Service
metadata:
  name: api-service
  namespace: production
spec:
  type: LoadBalancer
  selector:
    app: api
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  sessionAffinity: None
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-deployment
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 4
        periodSeconds: 30
      selectPolicy: Max

Deploy with:

## Configure kubectl for EKS
aws eks update-kubeconfig --name production-cluster --region us-east-1

## Apply manifests
kubectl apply -f deployment.yaml

## Verify deployment
kubectl get deployments -n production
kubectl get pods -n production
kubectl get hpa -n production

## View logs
kubectl logs -f deployment/api-deployment -n production

## Check scaling events
kubectl describe hpa api-hpa -n production

Kubernetes advantages:

  • Declarative configuration (desired state)
  • Built-in service discovery (CoreDNS)
  • Advanced scheduling (node affinity, taints/tolerations)
  • Ecosystem tools (Helm, Istio, Prometheus)

Common Mistakes

⚠️ 1. Insufficient Resource Allocation

Problem: Tasks crash with "OutOfMemory" errors or get throttled.

❌ Wrong:

{
  "cpu": "256",
  "memory": "512",
  "containerDefinitions": [{
    "name": "app",
    "memory": 512  // No headroom for JVM, buffers, etc.
  }]
}

βœ… Right:

{
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [{
    "name": "app",
    "memory": 768,  // 75% of task memory
    "memoryReservation": 512  // Soft limit
  }]
}

πŸ’‘ Tip: Monitor MemoryUtilization CloudWatch metric. Add 25-50% buffer over observed peak usage.

⚠️ 2. Missing Health Checks

Problem: ECS routes traffic to unhealthy containers, causing 5xx errors.

❌ Wrong:

{
  "containerDefinitions": [{
    "name": "app",
    "portMappings": [{"containerPort": 80}]
    // No healthCheck defined!
  }]
}

βœ… Right:

{
  "containerDefinitions": [{
    "name": "app",
    "portMappings": [{"containerPort": 80}],
    "healthCheck": {
      "command": ["CMD-SHELL", "curl -f http://localhost/health || exit 1"],
      "interval": 30,
      "timeout": 5,
      "retries": 3,
      "startPeriod": 60  // Grace period for slow startup
    }
  }]
}

Plus configure ALB target group health check:

elbv2.modify_target_group(
    TargetGroupArn='arn:...',
    HealthCheckEnabled=True,
    HealthCheckPath='/health',
    HealthCheckIntervalSeconds=30,
    HealthyThresholdCount=2,
    UnhealthyThresholdCount=3
)

⚠️ 3. Not Using awsvpc Network Mode

Problem: Can't use security groups per task, limited networking features.

❌ Wrong:

{
  "networkMode": "bridge",  // Old default
  "containerDefinitions": [{
    "portMappings": [{"hostPort": 0, "containerPort": 80}]  // Dynamic ports
  }]
}

βœ… Right:

{
  "networkMode": "awsvpc",
  "containerDefinitions": [{
    "portMappings": [{"containerPort": 80}]  // No hostPort needed
  }]
}

Benefits of awsvpc:

  • Task-specific security groups
  • VPC Flow Logs per task
  • Direct ENI attachment
  • Required for Fargate

⚠️ 4. Hardcoded Secrets in Task Definitions

Problem: Secrets exposed in console, CloudTrail, version control.

❌ Wrong:

{
  "environment": [
    {"name": "DB_PASSWORD", "value": "MySuperSecretPassword123"}  // NEVER!
  ]
}

βœ… Right:

{
  "secrets": [
    {
      "name": "DB_PASSWORD",
      "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789:secret:db-password-AbCdEf"
    }
  ],
  "executionRoleArn": "arn:aws:iam::123456789:role/ecsTaskExecutionRole"  // Must have secretsmanager:GetSecretValue
}

Or use Systems Manager Parameter Store:

{
  "secrets": [
    {
      "name": "DB_PASSWORD",
      "valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/prod/db/password"
    }
  ]
}

⚠️ 5. Inadequate Logging Configuration

Problem: Can't troubleshoot issues, no visibility into container behavior.

❌ Wrong:

{
  "containerDefinitions": [{
    "name": "app"
    // No logConfiguration - logs go nowhere!
  }]
}

βœ… Right:

{
  "containerDefinitions": [{
    "name": "app",
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/my-app",
        "awslogs-region": "us-east-1",
        "awslogs-stream-prefix": "app",
        "awslogs-datetime-format": "%Y-%m-%d %H:%M:%S"
      }
    }
  }]
}

Create log group first:

aws logs create-log-group --log-group-name /ecs/my-app
aws logs put-retention-policy --log-group-name /ecs/my-app --retention-in-days 7

⚠️ 6. Ignoring Deployment Configuration

Problem: All tasks replaced simultaneously, causing downtime.

❌ Wrong:

ecs.create_service(
    desiredCount=10
    # Using defaults: minimumHealthyPercent=100, maximumPercent=200
)

This can cause issues during deployments!

βœ… Right:

ecs.create_service(
    desiredCount=10,
    deploymentConfiguration={
        'minimumHealthyPercent': 100,  // Never go below 10 tasks
        'maximumPercent': 150,  // Can temporarily run 15 tasks (50% overhead)
        'deploymentCircuitBreaker': {
            'enable': True,
            'rollback': True  // Auto-rollback on repeated failures
        }
    }
)

For zero-downtime deployments:

  • minimumHealthyPercent: 100 ensures capacity maintained
  • maximumPercent: 200 allows full new set before terminating old
  • Circuit breaker prevents bad deployments from completing

⚠️ 7. Not Using Capacity Providers

Problem: Manual cluster scaling, inefficient resource usage.

❌ Wrong:

## Manually scaling Auto Scaling Group
autoscaling.set_desired_capacity(
    AutoScalingGroupName='ecs-cluster-asg',
    DesiredCapacity=10  # Guessing capacity needs
)

βœ… Right:

## Create capacity provider
ecs.create_capacity_provider(
    name='my-capacity-provider',
    autoScalingGroupProvider={
        'autoScalingGroupArn': 'arn:aws:autoscaling:...',
        'managedScaling': {
            'status': 'ENABLED',
            'targetCapacity': 80,  // Keep cluster 80% utilized
            'minimumScalingStepSize': 1,
            'maximumScalingStepSize': 10
        },
        'managedTerminationProtection': 'ENABLED'
    }
)

## Associate with cluster
ecs.put_cluster_capacity_providers(
    cluster='production',
    capacityProviders=['my-capacity-provider'],
    defaultCapacityProviderStrategy=[{
        'capacityProvider': 'my-capacity-provider',
        'weight': 1,
        'base': 2  // Always keep 2 instances minimum
    }]
)

Capacity providers automatically scale cluster to meet task demands.

⚠️ 8. Incorrect Service Discovery Configuration

Problem: Services can't find each other, hardcoded IPs break on updates.

❌ Wrong:

## Hardcoding service endpoints
os.environ['API_URL'] = 'http://10.0.1.45:8080'  # IP changes on deployment!

βœ… Right:

## Use Cloud Map service discovery
servicediscovery.create_service(
    Name='api-service',
    NamespaceId='ns-xxx',
    DnsConfig={
        'DnsRecords': [{'Type': 'A', 'TTL': 60}],
        'RoutingPolicy': 'MULTIVALUE'
    },
    HealthCheckCustomConfig={'FailureThreshold': 1}
)

## Configure ECS service
ecs.create_service(
    serviceName='api',
    serviceRegistries=[{
        'registryArn': 'arn:aws:servicediscovery:...:service/srv-xxx'
    }]
)

## Now reference by name
os.environ['API_URL'] = 'http://api-service.internal.example.com'

Key Takeaways

🎯 Container Orchestration Essentials:

  1. Choose the right service:

    • ECS for AWS-native simplicity
    • EKS for Kubernetes portability
    • Fargate for serverless operation
  2. Task definitions are blueprints - They define:

    • Container images and versions
    • Resource allocations (CPU/memory)
    • Networking configuration
    • IAM roles and permissions
    • Environment variables and secrets
  3. Services maintain desired state - They:

    • Keep specified number of tasks running
    • Integrate with load balancers
    • Handle rolling deployments
    • Auto-replace failed tasks
  4. Networking matters - Use awsvpc mode for:

    • Task-level security groups
    • Enhanced monitoring
    • Fargate compatibility
    • Production workloads
  5. Implement health checks everywhere:

    • Container health checks (task-level)
    • Target group health checks (ALB-level)
    • Application health endpoints
  6. Auto-scale intelligently:

    • Target tracking for CPU/memory
    • Request-count based for user traffic
    • Scheduled scaling for predictable patterns
    • Capacity providers for cluster scaling
  7. Security best practices:

    • Store secrets in Secrets Manager/Parameter Store
    • Use task-specific IAM roles
    • Run containers as non-root users
    • Scan images for vulnerabilities
  8. Deployment strategies reduce risk:

    • Rolling updates for gradual migration
    • Blue/green for instant rollback
    • Circuit breakers for auto-rollback
    • Canary deployments for testing
  9. Observability is critical:

    • CloudWatch Logs for application logs
    • CloudWatch Container Insights for metrics
    • AWS X-Ray for distributed tracing
    • ECS Exec for debugging
  10. Cost optimization techniques:

    • Use Fargate Spot for fault-tolerant workloads
    • Right-size task resources
    • Implement auto-scaling
    • Use Savings Plans for predictable workloads

πŸ“‹ Quick Reference Card

ECS ClusterLogical grouping of container instances
Task DefinitionJSON blueprint for containers
TaskRunning instance of task definition
ServiceMaintains desired task count + load balancing
FargateServerless compute for containers
awsvpcNetwork mode giving each task its own ENI
Cloud MapService discovery via DNS
Capacity ProviderAuto-scales cluster infrastructure
Target TrackingMaintains metric at target value
Blue/GreenDeployment with instant rollback capability

πŸ“š Further Study

Official AWS Documentation:

Best Practices:

πŸŽ“ Continue your AWS journey by exploring service mesh architectures with AWS App Mesh, GitOps workflows with AWS CodePipeline, and cost optimization strategies for containerized workloads!