Fill in the Kubernetes deployment scaling configuration: ```yaml apiVersion: autoscaling/v2 kind: {{1}} metadata: name: api-hpa spec: minReplicas: 3 maxReplicas: 20 metrics: - type: {{2}} resource: name: cpu ```

["HorizontalPodAutoscaler","Resource"]

Container Orchestration

ECS vs EKS vs Fargate comparison, when to use each, and container deployment strategies

Container Orchestration on AWS

Master container orchestration with free flashcards and spaced repetition practice. This lesson covers Amazon ECS, EKS, Fargate, service discovery, and scaling strategies—essential concepts for building production-ready containerized applications on AWS.

Welcome to Container Orchestration

💻 Container orchestration transforms how we deploy and manage applications at scale. While Docker packages your application into containers, orchestration platforms like Amazon ECS (Elastic Container Service) and Amazon EKS (Elastic Kubernetes Service) handle the complex tasks of scheduling, scaling, health monitoring, and service discovery across fleets of containers.

Think of containers as shipping containers and orchestration as the port management system. Just as ports coordinate where containers go, how they're loaded onto ships, and when they arrive, container orchestration platforms coordinate where your application containers run, how they scale, and how they communicate.

🔍 Why Container Orchestration Matters:

Automated deployment - Deploy hundreds of containers with a single command
Self-healing - Automatically replace failed containers
Dynamic scaling - Add/remove containers based on demand
Service discovery - Containers find each other automatically
Rolling updates - Update applications with zero downtime
Resource optimization - Pack containers efficiently across hosts

Core Concepts

Container Orchestration Fundamentals

Container orchestration is the automated management of containerized application lifecycles. On AWS, you have three primary options:

Service	Best For	Control Level	Learning Curve
Amazon ECS	AWS-native apps	High	Low
Amazon EKS	Kubernetes workloads	Very High	High
AWS Fargate	Serverless containers	Low	Very Low

Amazon ECS Architecture

Amazon ECS is AWS's proprietary container orchestration service. It uses familiar AWS concepts and integrates seamlessly with other AWS services.

┌────────────────────────────────────────────────┐
│           ECS ARCHITECTURE                     │
├────────────────────────────────────────────────┤
│                                                │
│  ┌──────────────┐         ┌──────────────┐    │
│  │   CLUSTER    │         │   SERVICE    │    │
│  │  (Logical    │─────────│  (Desired    │    │
│  │   Group)     │         │   State)     │    │
│  └──────────────┘         └──────┬───────┘    │
│         │                        │            │
│         ▼                        ▼            │
│  ┌──────────────┐         ┌──────────────┐    │
│  │  CONTAINER   │◀────────│    TASK      │    │
│  │  INSTANCES   │         │ (Container   │    │
│  │  (EC2/       │         │  Definition) │    │
│  │   Fargate)   │         └──────────────┘    │
│  └──────────────┘                             │
│         │                                     │
│         ▼                                     │
│  ┌──────────────┐                             │
│  │  CONTAINERS  │                             │
│  │  🐳 🐳 🐳    │                             │
│  └──────────────┘                             │
└────────────────────────────────────────────────┘

Key ECS Components:

Cluster - Logical grouping of container instances (EC2 or Fargate)
Task Definition - Blueprint describing your containers (like a Dockerfile for orchestration)
Task - Running instance of a task definition
Service - Maintains desired number of tasks, handles load balancing
Container Instance - EC2 instance running the ECS agent (not needed with Fargate)

💡 Memory Aid - CTTSC: Cluster holds Task definitions that run Tasks via Services on Container instances.

Task Definitions

A task definition is a JSON blueprint that describes:

Which container images to use
CPU and memory requirements
Networking mode
IAM roles
Environment variables
Volume mounts

Here's a basic task definition structure:

{
  "family": "web-app",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "containerDefinitions": [
    {
      "name": "web-container",
      "image": "nginx:latest",
      "portMappings": [
        {
          "containerPort": 80,
          "protocol": "tcp"
        }
      ],
      "essential": true,
      "environment": [
        {
          "name": "ENV",
          "value": "production"
        }
      ]
    }
  ]
}

Important fields:

family - Groups related task definition versions
networkMode - How containers communicate (awsvpc gives each task its own ENI)
requiresCompatibilities - EC2, Fargate, or both
cpu/memory - Resource allocations (in Fargate units)
essential - If true, task stops when this container stops

Amazon EKS (Elastic Kubernetes Service)

Amazon EKS runs upstream Kubernetes on AWS, giving you full Kubernetes compatibility. It manages the control plane (master nodes) while you manage worker nodes.

┌────────────────────────────────────────────────┐
│           EKS ARCHITECTURE                     │
├────────────────────────────────────────────────┤
│                                                │
│  ┌────────────────────────────┐               │
│  │    CONTROL PLANE           │               │
│  │  (AWS Managed)             │               │
│  │  ┌──────────┐              │               │
│  │  │ API      │              │               │
│  │  │ Server   │              │               │
│  │  └────┬─────┘              │               │
│  │  ┌────┴─────┐              │               │
│  │  │ etcd     │              │               │
│  │  │ Scheduler│              │               │
│  │  │ Controller              │               │
│  │  └──────────┘              │               │
│  └──────────┬─────────────────┘               │
│             │                                 │
│             ▼                                 │
│  ┌──────────────────────────────┐             │
│  │    WORKER NODES (Your VPC)   │             │
│  │  ┌──────┐  ┌──────┐  ┌──────┐             │
│  │  │ Pod  │  │ Pod  │  │ Pod  │             │
│  │  │ 🐳🐳 │  │ 🐳🐳 │  │ 🐳🐳 │             │
│  │  └──────┘  └──────┘  └──────┘             │
│  │   Node 1    Node 2    Node 3              │
│  └──────────────────────────────┘             │
└────────────────────────────────────────────────┘

Kubernetes vs ECS Terminology:

ECS Term	Kubernetes Term	Description
Task	Pod	Group of containers running together
Service	Deployment + Service	Manages replicas and exposes them
Task Definition	Pod Spec	Container configuration blueprint
Cluster	Cluster	Group of compute resources

AWS Fargate: Serverless Containers

AWS Fargate is a serverless compute engine for containers. You don't manage EC2 instances—just define your containers and Fargate handles the infrastructure.

🎯 Key Benefits:

No server management - AWS provisions and scales compute
Pay per use - Charged only for vCPU and memory consumed
Better security - Task-level isolation, each task has its own kernel
Right-sizing - Precise resource allocation per task

Fargate Launch Type vs EC2 Launch Type:

Aspect	Fargate	EC2
Server Management	✅ Fully managed	❌ You manage EC2 instances
Scaling	✅ Instant, per-task	⚠️ Must scale instances first
Pricing	Per vCPU-second + memory	EC2 instance pricing
Use Case	Variable workloads, quick starts	Consistent workloads, cost optimization
Control	Limited (no host access)	Full (SSH, custom AMIs)

Service Discovery

Service discovery enables containers to find and communicate with each other automatically. AWS provides two mechanisms:

1. AWS Cloud Map

Cloud Map creates a service registry integrated with Route 53:

import boto3

servicediscovery = boto3.client('servicediscovery')

## Create namespace
namespace = servicediscovery.create_private_dns_namespace(
    Name='internal.example.com',
    Vpc='vpc-12345678'
)

## Create service
service = servicediscovery.create_service(
    Name='web-api',
    DnsConfig={
        'DnsRecords': [{'Type': 'A', 'TTL': 60}]
    },
    HealthCheckCustomConfig={'FailureThreshold': 1}
)

Now containers can call web-api.internal.example.com and Cloud Map routes to healthy instances.

2. Application Load Balancer (ALB) with Target Groups

ALB can route to ECS tasks dynamically:

{
  "loadBalancers": [
    {
      "targetGroupArn": "arn:aws:elasticloadbalancing:...",
      "containerName": "web-container",
      "containerPort": 80
    }
  ],
  "networkConfiguration": {
    "awsvpcConfiguration": {
      "subnets": ["subnet-12345", "subnet-67890"],
      "securityGroups": ["sg-12345"],
      "assignPublicIp": "ENABLED"
    }
  }
}

ECS automatically registers/deregisters tasks from the target group.

Auto Scaling Strategies

Container orchestration enables sophisticated scaling patterns:

1. Service Auto Scaling (Task Level)

Target Tracking Scaling - Maintain a metric at target value:

{
  "ServiceName": "web-service",
  "ScalableDimension": "ecs:service:DesiredCount",
  "PolicyName": "cpu-target-tracking",
  "PolicyType": "TargetTrackingScaling",
  "TargetTrackingScalingPolicyConfiguration": {
    "TargetValue": 75.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 300
  }
}

This maintains average CPU at 75% by adding/removing tasks.

2. Cluster Auto Scaling (Infrastructure Level)

Capacity Providers manage cluster capacity automatically:

ecs.put_cluster_capacity_providers(
    cluster='production',
    capacityProviders=['FARGATE', 'my-capacity-provider'],
    defaultCapacityProviderStrategy=[
        {
            'capacityProvider': 'FARGATE',
            'weight': 1,
            'base': 2
        }
    ]
)

Capacity provider strategy:

base - Minimum tasks using this provider
weight - Relative distribution (e.g., 3:1 ratio between providers)

3. Scheduled Scaling

Scale based on predictable patterns:

scaling = boto3.client('application-autoscaling')

## Scale up for business hours
scaling.put_scheduled_action(
    ServiceNamespace='ecs',
    ScheduledActionName='scale-up-morning',
    ResourceId='service/production/web-app',
    ScalableDimension='ecs:service:DesiredCount',
    Schedule='cron(0 8 * * ? *)',
    ScalableTargetAction={'MinCapacity': 10, 'MaxCapacity': 50}
)

## Scale down after hours
scaling.put_scheduled_action(
    ServiceNamespace='ecs',
    ScheduledActionName='scale-down-evening',
    ResourceId='service/production/web-app',
    ScalableDimension='ecs:service:DesiredCount',
    Schedule='cron(0 20 * * ? *)',
    ScalableTargetAction={'MinCapacity': 2, 'MaxCapacity': 10}
)

SCALING DECISION FLOW

    📊 Metric Threshold Reached
           |
           ▼
    ⏱️ Cooldown Period Over?
           |
      ┌────┴────┐
      ▼         ▼
    ✅ YES    ❌ NO → Wait
      |         
      ▼         
    🔺 Scale Action Triggered
      |
      ▼
    ⚙️ Provision/Terminate Tasks
      |
      ▼
    🔄 Update Desired Count
      |
      ▼
    ⏱️ Start Cooldown Timer

💡 Scaling Best Practice: Set scale-out cooldown low (60s) but scale-in cooldown high (300s+). This allows quick response to load increases but prevents flapping during decreases.

Task Networking Modes

ECS supports multiple networking modes:

Mode	How It Works	Use Case	Fargate Support
awsvpc	Each task gets its own ENI with private IP	Microservices, security groups per task	✅ Required
bridge	Docker bridge on host, port mapping	Simple apps, port conflicts OK	❌ No
host	Direct host network, no isolation	Maximum performance, no port conflicts	❌ No
none	No external networking	Batch jobs, local processing	❌ No

awsvpc mode is recommended for production:

{
  "networkMode": "awsvpc",
  "networkConfiguration": {
    "awsvpcConfiguration": {
      "subnets": ["subnet-abc123", "subnet-def456"],
      "securityGroups": ["sg-web-tier"],
      "assignPublicIp": "DISABLED"
    }
  }
}

Benefits:

Task-level security groups
VPC Flow Logs per task
Direct integration with VPC routing
Required for Fargate

⚠️ ENI Limit Warning: Each awsvpc task consumes one ENI. Check EC2 instance ENI limits when sizing.

Detailed Examples

Example 1: Deploying a Microservice with ECS and Fargate

Let's deploy a Node.js API with automatic scaling:

Step 1: Create Task Definition

{
  "family": "api-service",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789:role/apiTaskRole",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/api:latest",
      "cpu": 512,
      "memory": 1024,
      "essential": true,
      "portMappings": [
        {
          "containerPort": 3000,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {"name": "NODE_ENV", "value": "production"},
        {"name": "PORT", "value": "3000"}
      ],
      "secrets": [
        {
          "name": "DB_PASSWORD",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789:secret:db-password"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/api-service",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "api"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ]
}

Key elements explained:

executionRoleArn - IAM role for ECS agent (pull images, write logs)
taskRoleArn - IAM role for your application code (access AWS services)
secrets - Inject from Secrets Manager (never hardcode passwords)
healthCheck - ECS replaces unhealthy containers automatically
logConfiguration - Send logs to CloudWatch Logs

Step 2: Create ECS Service

import boto3

ecs = boto3.client('ecs')

response = ecs.create_service(
    cluster='production',
    serviceName='api-service',
    taskDefinition='api-service:3',
    desiredCount=3,
    launchType='FARGATE',
    networkConfiguration={
        'awsvpcConfiguration': {
            'subnets': ['subnet-private1', 'subnet-private2'],
            'securityGroups': ['sg-api-tier'],
            'assignPublicIp': 'DISABLED'
        }
    },
    loadBalancers=[
        {
            'targetGroupArn': 'arn:aws:elasticloadbalancing:...',
            'containerName': 'api',
            'containerPort': 3000
        }
    ],
    deploymentConfiguration={
        'minimumHealthyPercent': 100,
        'maximumPercent': 200,
        'deploymentCircuitBreaker': {
            'enable': True,
            'rollback': True
        }
    },
    enableExecuteCommand=True  # Enable ECS Exec for debugging
)

Deployment configuration:

minimumHealthyPercent: 100 - Always keep all tasks healthy during deployment
maximumPercent: 200 - Can temporarily run 2x tasks (rolling deployment)
deploymentCircuitBreaker - Auto-rollback if deployment fails

Step 3: Configure Auto Scaling

scaling = boto3.client('application-autoscaling')

## Register scalable target
scaling.register_scalable_target(
    ServiceNamespace='ecs',
    ResourceId='service/production/api-service',
    ScalableDimension='ecs:service:DesiredCount',
    MinCapacity=3,
    MaxCapacity=20
)

## CPU-based scaling
scaling.put_scaling_policy(
    PolicyName='cpu-scaling',
    ServiceNamespace='ecs',
    ResourceId='service/production/api-service',
    ScalableDimension='ecs:service:DesiredCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 70.0,
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'ECSServiceAverageCPUUtilization'
        },
        'ScaleOutCooldown': 60,
        'ScaleInCooldown': 300
    }
)

## Memory-based scaling
scaling.put_scaling_policy(
    PolicyName='memory-scaling',
    ServiceNamespace='ecs',
    ResourceId='service/production/api-service',
    ScalableDimension='ecs:service:DesiredCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 80.0,
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'ECSServiceAverageMemoryUtilization'
        },
        'ScaleOutCooldown': 60,
        'ScaleInCooldown': 300
    }
)

## Request-count-based scaling (ALB metric)
scaling.put_scaling_policy(
    PolicyName='request-count-scaling',
    ServiceNamespace='ecs',
    ResourceId='service/production/api-service',
    ScalableDimension='ecs:service:DesiredCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 1000.0,
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'ALBRequestCountPerTarget',
            'ResourceLabel': 'app/my-alb/xxx/targetgroup/my-tg/yyy'
        },
        'ScaleOutCooldown': 60,
        'ScaleInCooldown': 300
    }
)

This creates three scaling policies that work together. AWS uses the one requiring the most capacity at any moment.

Example 2: Blue/Green Deployment with CodeDeploy

Blue/green deployments minimize downtime and enable instant rollback:

## appspec.yaml for CodeDeploy
version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: "arn:aws:ecs:us-east-1:123456789:task-definition/api-service:4"
        LoadBalancerInfo:
          ContainerName: "api"
          ContainerPort: 3000
        PlatformVersion: "LATEST"
Hooks:
  - BeforeInstall: "LambdaFunctionToValidateBeforeInstall"
  - AfterInstall: "LambdaFunctionToValidateAfterInstall"
  - AfterAllowTestTraffic: "LambdaFunctionToTestNewVersion"
  - BeforeAllowTraffic: "LambdaFunctionToValidateBeforeTrafficShift"
  - AfterAllowTraffic: "LambdaFunctionToValidateAfterTrafficShift"

Deployment flow:

┌──────────────────────────────────────────────┐
│       BLUE/GREEN DEPLOYMENT FLOW             │
├──────────────────────────────────────────────┤
│                                              │
│  Step 1: 🔵 BLUE (Current)                  │
│          ALB → Target Group 1 → Blue Tasks  │
│          100% traffic                        │
│                                              │
│  Step 2: 🟢 GREEN (New) Provisioned         │
│          ALB → Target Group 1 → Blue Tasks  │
│                 Target Group 2 → Green Tasks│
│          100% → Blue, 0% → Green            │
│                                              │
│  Step 3: 🔍 Test Traffic (Optional)         │
│          Test listener routes to Green      │
│          Run validation tests               │
│                                              │
│  Step 4: ⚡ Traffic Shift                   │
│          ALB switches to Target Group 2     │
│          0% → Blue, 100% → Green            │
│                                              │
│  Step 5: 🗑️ Cleanup (After wait)           │
│          Terminate Blue tasks               │
│          Green becomes new Blue             │
│                                              │
└──────────────────────────────────────────────┘

Configure deployment in CodeDeploy:

codedeploy = boto3.client('codedeploy')

response = codedeploy.create_deployment_group(
    applicationName='api-app',
    deploymentGroupName='api-prod-dg',
    deploymentConfigName='CodeDeployDefault.ECSCanary10Percent5Minutes',
    serviceRoleArn='arn:aws:iam::123456789:role/CodeDeployRole',
    ecsServices=[
        {
            'serviceName': 'api-service',
            'clusterName': 'production'
        }
    ],
    loadBalancerInfo={
        'targetGroupPairInfoList': [
            {
                'targetGroups': [
                    {'name': 'api-blue-tg'},
                    {'name': 'api-green-tg'}
                ],
                'prodTrafficRoute': {
                    'listenerArns': ['arn:aws:elasticloadbalancing:...']
                },
                'testTrafficRoute': {
                    'listenerArns': ['arn:aws:elasticloadbalancing:...:8080']
                }
            }
        ]
    },
    blueGreenDeploymentConfiguration={
        'terminateBlueInstancesOnDeploymentSuccess': {
            'action': 'TERMINATE',
            'terminationWaitTimeInMinutes': 5
        },
        'deploymentReadyOption': {
            'actionOnTimeout': 'CONTINUE_DEPLOYMENT'
        }
    }
)

Deployment configurations:

CodeDeployDefault.ECSLinear10PercentEvery1Minutes - 10% every minute
CodeDeployDefault.ECSCanary10Percent5Minutes - 10% first, wait 5min, then 90%
CodeDeployDefault.ECSAllAtOnce - Instant cutover

Example 3: Multi-Container Task with Sidecar Pattern

The sidecar pattern places helper containers alongside your main application:

{
  "family": "web-with-logging",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "1024",
  "memory": "2048",
  "containerDefinitions": [
    {
      "name": "web-app",
      "image": "my-web-app:latest",
      "cpu": 768,
      "memory": 1536,
      "essential": true,
      "portMappings": [{"containerPort": 80}],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/web-app",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "web"
        }
      },
      "dependsOn": [
        {
          "containerName": "log-router",
          "condition": "START"
        }
      ]
    },
    {
      "name": "log-router",
      "image": "fluent/fluentd:latest",
      "cpu": 256,
      "memory": 512,
      "essential": false,
      "environment": [
        {"name": "FLUENTD_CONF", "value": "fluentd.conf"}
      ],
      "mountPoints": [
        {
          "sourceVolume": "logs",
          "containerPath": "/var/log/app"
        }
      ]
    }
  ],
  "volumes": [
    {
      "name": "logs",
      "host": {}
    }
  ]
}

Key points:

Main container marked essential: true - task stops if it fails
Sidecar marked essential: false - task continues if it fails
dependsOn ensures log-router starts before web-app
Shared volume for inter-container communication

Common sidecar use cases:

Log aggregation (Fluentd, Fluent Bit)
Service mesh proxy (Envoy, Linkerd)
Secret management (Vault agent)
Monitoring agents (Datadog, New Relic)

Example 4: EKS Deployment with kubectl

Deploy a microservice to EKS using Kubernetes manifests:

deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-deployment
  namespace: production
  labels:
    app: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: api
        version: v1
    spec:
      serviceAccountName: api-service-account
      containers:
      - name: api
        image: 123456789.dkr.ecr.us-east-1.amazonaws.com/api:v1.2.3
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        env:
        - name: ENV
          value: production
        - name: DB_HOST
          valueFrom:
            configMapKeyRef:
              name: api-config
              key: database.host
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: api-secrets
              key: db.password
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2
---
apiVersion: v1
kind: Service
metadata:
  name: api-service
  namespace: production
spec:
  type: LoadBalancer
  selector:
    app: api
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080
  sessionAffinity: None
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-deployment
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 4
        periodSeconds: 30
      selectPolicy: Max

Deploy with:

## Configure kubectl for EKS
aws eks update-kubeconfig --name production-cluster --region us-east-1

## Apply manifests
kubectl apply -f deployment.yaml

## Verify deployment
kubectl get deployments -n production
kubectl get pods -n production
kubectl get hpa -n production

## View logs
kubectl logs -f deployment/api-deployment -n production

## Check scaling events
kubectl describe hpa api-hpa -n production

Kubernetes advantages:

Declarative configuration (desired state)
Built-in service discovery (CoreDNS)
Advanced scheduling (node affinity, taints/tolerations)
Ecosystem tools (Helm, Istio, Prometheus)

Common Mistakes

⚠️ 1. Insufficient Resource Allocation

Problem: Tasks crash with "OutOfMemory" errors or get throttled.

❌ Wrong:

{
  "cpu": "256",
  "memory": "512",
  "containerDefinitions": [{
    "name": "app",
    "memory": 512  // No headroom for JVM, buffers, etc.
  }]
}

✅ Right:

{
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [{
    "name": "app",
    "memory": 768,  // 75% of task memory
    "memoryReservation": 512  // Soft limit
  }]
}

💡 Tip: Monitor MemoryUtilization CloudWatch metric. Add 25-50% buffer over observed peak usage.

⚠️ 2. Missing Health Checks

Problem: ECS routes traffic to unhealthy containers, causing 5xx errors.

❌ Wrong:

{
  "containerDefinitions": [{
    "name": "app",
    "portMappings": [{"containerPort": 80}]
    // No healthCheck defined!
  }]
}

✅ Right:

{
  "containerDefinitions": [{
    "name": "app",
    "portMappings": [{"containerPort": 80}],
    "healthCheck": {
      "command": ["CMD-SHELL", "curl -f http://localhost/health || exit 1"],
      "interval": 30,
      "timeout": 5,
      "retries": 3,
      "startPeriod": 60  // Grace period for slow startup
    }
  }]
}

Plus configure ALB target group health check:

elbv2.modify_target_group(
    TargetGroupArn='arn:...',
    HealthCheckEnabled=True,
    HealthCheckPath='/health',
    HealthCheckIntervalSeconds=30,
    HealthyThresholdCount=2,
    UnhealthyThresholdCount=3
)

⚠️ 3. Not Using awsvpc Network Mode

Problem: Can't use security groups per task, limited networking features.

❌ Wrong:

{
  "networkMode": "bridge",  // Old default
  "containerDefinitions": [{
    "portMappings": [{"hostPort": 0, "containerPort": 80}]  // Dynamic ports
  }]
}

✅ Right:

{
  "networkMode": "awsvpc",
  "containerDefinitions": [{
    "portMappings": [{"containerPort": 80}]  // No hostPort needed
  }]
}

Benefits of awsvpc:

Task-specific security groups
VPC Flow Logs per task
Direct ENI attachment
Required for Fargate

⚠️ 4. Hardcoded Secrets in Task Definitions

Problem: Secrets exposed in console, CloudTrail, version control.

❌ Wrong:

{
  "environment": [
    {"name": "DB_PASSWORD", "value": "MySuperSecretPassword123"}  // NEVER!
  ]
}

✅ Right:

{
  "secrets": [
    {
      "name": "DB_PASSWORD",
      "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789:secret:db-password-AbCdEf"
    }
  ],
  "executionRoleArn": "arn:aws:iam::123456789:role/ecsTaskExecutionRole"  // Must have secretsmanager:GetSecretValue
}

Or use Systems Manager Parameter Store:

{
  "secrets": [
    {
      "name": "DB_PASSWORD",
      "valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/prod/db/password"
    }
  ]
}

⚠️ 5. Inadequate Logging Configuration

Problem: Can't troubleshoot issues, no visibility into container behavior.

❌ Wrong:

{
  "containerDefinitions": [{
    "name": "app"
    // No logConfiguration - logs go nowhere!
  }]
}

✅ Right:

{
  "containerDefinitions": [{
    "name": "app",
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/my-app",
        "awslogs-region": "us-east-1",
        "awslogs-stream-prefix": "app",
        "awslogs-datetime-format": "%Y-%m-%d %H:%M:%S"
      }
    }
  }]
}

Create log group first:

aws logs create-log-group --log-group-name /ecs/my-app
aws logs put-retention-policy --log-group-name /ecs/my-app --retention-in-days 7

⚠️ 6. Ignoring Deployment Configuration

Problem: All tasks replaced simultaneously, causing downtime.

❌ Wrong:

ecs.create_service(
    desiredCount=10
    # Using defaults: minimumHealthyPercent=100, maximumPercent=200
)

This can cause issues during deployments!

✅ Right:

ecs.create_service(
    desiredCount=10,
    deploymentConfiguration={
        'minimumHealthyPercent': 100,  // Never go below 10 tasks
        'maximumPercent': 150,  // Can temporarily run 15 tasks (50% overhead)
        'deploymentCircuitBreaker': {
            'enable': True,
            'rollback': True  // Auto-rollback on repeated failures
        }
    }
)

For zero-downtime deployments:

minimumHealthyPercent: 100 ensures capacity maintained
maximumPercent: 200 allows full new set before terminating old
Circuit breaker prevents bad deployments from completing

⚠️ 7. Not Using Capacity Providers

Problem: Manual cluster scaling, inefficient resource usage.

❌ Wrong:

## Manually scaling Auto Scaling Group
autoscaling.set_desired_capacity(
    AutoScalingGroupName='ecs-cluster-asg',
    DesiredCapacity=10  # Guessing capacity needs
)

✅ Right:

## Create capacity provider
ecs.create_capacity_provider(
    name='my-capacity-provider',
    autoScalingGroupProvider={
        'autoScalingGroupArn': 'arn:aws:autoscaling:...',
        'managedScaling': {
            'status': 'ENABLED',
            'targetCapacity': 80,  // Keep cluster 80% utilized
            'minimumScalingStepSize': 1,
            'maximumScalingStepSize': 10
        },
        'managedTerminationProtection': 'ENABLED'
    }
)

## Associate with cluster
ecs.put_cluster_capacity_providers(
    cluster='production',
    capacityProviders=['my-capacity-provider'],
    defaultCapacityProviderStrategy=[{
        'capacityProvider': 'my-capacity-provider',
        'weight': 1,
        'base': 2  // Always keep 2 instances minimum
    }]
)

Capacity providers automatically scale cluster to meet task demands.

⚠️ 8. Incorrect Service Discovery Configuration

Problem: Services can't find each other, hardcoded IPs break on updates.

❌ Wrong:

## Hardcoding service endpoints
os.environ['API_URL'] = 'http://10.0.1.45:8080'  # IP changes on deployment!

✅ Right:

## Use Cloud Map service discovery
servicediscovery.create_service(
    Name='api-service',
    NamespaceId='ns-xxx',
    DnsConfig={
        'DnsRecords': [{'Type': 'A', 'TTL': 60}],
        'RoutingPolicy': 'MULTIVALUE'
    },
    HealthCheckCustomConfig={'FailureThreshold': 1}
)

## Configure ECS service
ecs.create_service(
    serviceName='api',
    serviceRegistries=[{
        'registryArn': 'arn:aws:servicediscovery:...:service/srv-xxx'
    }]
)

## Now reference by name
os.environ['API_URL'] = 'http://api-service.internal.example.com'

Key Takeaways

🎯 Container Orchestration Essentials:

Choose the right service:
- ECS for AWS-native simplicity
- EKS for Kubernetes portability
- Fargate for serverless operation
Task definitions are blueprints - They define:
- Container images and versions
- Resource allocations (CPU/memory)
- Networking configuration
- IAM roles and permissions
- Environment variables and secrets
Services maintain desired state - They:
- Keep specified number of tasks running
- Integrate with load balancers
- Handle rolling deployments
- Auto-replace failed tasks
Networking matters - Use awsvpc mode for:
- Task-level security groups
- Enhanced monitoring
- Fargate compatibility
- Production workloads
Implement health checks everywhere:
- Container health checks (task-level)
- Target group health checks (ALB-level)
- Application health endpoints
Auto-scale intelligently:
- Target tracking for CPU/memory
- Request-count based for user traffic
- Scheduled scaling for predictable patterns
- Capacity providers for cluster scaling
Security best practices:
- Store secrets in Secrets Manager/Parameter Store
- Use task-specific IAM roles
- Run containers as non-root users
- Scan images for vulnerabilities
Deployment strategies reduce risk:
- Rolling updates for gradual migration
- Blue/green for instant rollback
- Circuit breakers for auto-rollback
- Canary deployments for testing
Observability is critical:
- CloudWatch Logs for application logs
- CloudWatch Container Insights for metrics
- AWS X-Ray for distributed tracing
- ECS Exec for debugging
Cost optimization techniques:
- Use Fargate Spot for fault-tolerant workloads
- Right-size task resources
- Implement auto-scaling
- Use Savings Plans for predictable workloads

📋 Quick Reference Card

ECS Cluster	Logical grouping of container instances
Task Definition	JSON blueprint for containers
Task	Running instance of task definition
Service	Maintains desired task count + load balancing
Fargate	Serverless compute for containers
awsvpc	Network mode giving each task its own ENI
Cloud Map	Service discovery via DNS
Capacity Provider	Auto-scales cluster infrastructure
Target Tracking	Maintains metric at target value
Blue/Green	Deployment with instant rollback capability

📚 Further Study

Official AWS Documentation:

Amazon ECS Developer Guide - Comprehensive ECS documentation
Amazon EKS User Guide - Complete EKS reference
AWS Fargate User Guide - Fargate-specific configuration

Best Practices:

ECS Best Practices Guide - Official best practices
AWS Architecture Center - Containers - Reference architectures

🎓 Continue your AWS journey by exploring service mesh architectures with AWS App Mesh, GitOps workflows with AWS CodePipeline, and cost optimization strategies for containerized workloads!

📝

Ready to practice?

This lesson has 15 questions to help you learn