Building Production-Grade ECS Microservices with CI/CD - Part 1: Architecture Overview

Comprehensive guide to building a production-ready microservices architecture on AWS ECS Fargate with full CI/CD automation, covering architecture design, security, scalability, and cost optimization.

Building Production-Grade ECS Microservices with CI/CD - Part 1: Architecture Overview

Table of Contents

Building Production-Grade ECS Microservices with CI/CD - Part 1: Architecture Overview

In today’s cloud-native world, building scalable, reliable, and maintainable applications requires a solid architectural foundation. This comprehensive series will guide you through creating a production-grade microservices architecture on AWS ECS Fargate with complete CI/CD automation.

What We’re Building

This project demonstrates a production-ready microservices architecture that includes:

  • Multi-tier application with Flask API, Nginx reverse proxy, and Redis caching
  • Container orchestration using AWS ECS Fargate (serverless)
  • High availability across multiple availability zones
  • Auto-scaling based on CPU utilization
  • Complete CI/CD pipeline with GitHub Actions
  • Infrastructure as Code using Terraform
  • Security best practices and monitoring

Architecture Overview

Our architecture follows AWS Well-Architected Framework principles, implementing a robust, scalable, and cost-effective solution.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Internet/Users                           │
└────────────────────────┬────────────────────────────────────────┘
                         │
                         ▼
         ┌───────────────────────────────┐
         │  Route53 DNS (Optional)       │
         │  - Custom domain              │
         │  - ACM SSL Certificate        │
         └───────────┬───────────────────┘
                     │
                     ▼
    ┌────────────────────────────────────────┐
    │   Application Load Balancer (ALB)      │
    │   - Public subnets (2 AZs)             │
    │   - HTTP/HTTPS listeners               │
    │   - Health checks                      │
    └─────────────┬──────────────────────────┘
                  │
    ┌─────────────┴─────────────┐
    │                           │
    ▼                           ▼
┌────────────────────┐  ┌────────────────────┐
│   AZ: ap-south-1a  │  │   AZ: ap-south-1b  │
│  Public Subnet     │  │  Public Subnet     │
│  10.0.1.0/24       │  │  10.0.2.0/24       │
└─────────┬──────────┘  └──────────┬─────────┘
          │                        │
    ┌─────┴────────────────────────┴─────┐
    │      NAT Gateways (Multi-AZ)       │
    └─────┬────────────────────────┬─────┘
          │                        │
          ▼                        ▼
┌────────────────────┐  ┌────────────────────┐
│  Private Subnet    │  │  Private Subnet    │
│  10.0.11.0/24      │  │  10.0.12.0/24      │
│                    │  │                    │
│  ┌──────────────┐ │  │ ┌──────────────┐   │
│  │ ECS Fargate  │ │  │ │ ECS Fargate  │   │
│  │   Tasks      │ │  │ │   Tasks      │   │
│  │              │ │  │ │              │   │
│  │ ┌──────────┐ │ │  │ │ ┌──────────┐ │   │
│  │ │  Nginx   │ │ │  │ │ │  Nginx   │ │   │
│  │ │  (x2)    │ │ │  │ │ │  (x2)    │ │   │
│  │ └────┬─────┘ │ │  │ │ └────┬─────┘ │   │
│  │      │       │ │  │ │      │       │   │
│  │ ┌────▼─────┐ │ │  │ │ ┌────▼─────┐ │   │
│  │ │ Flask    │ │ │  │ │ │ Flask    │ │   │
│  │ │ API (x2) │ │ │  │ │ │ API (x2) │ │   │
│  │ └────┬─────┘ │ │  │ │ └────┬─────┘ │   │
│  │      │       │ │  │ │      │       │   │
│  │ ┌────▼─────┐ │ │  │ │ ┌────▼─────┐ │   │
│  │ │ Redis    │ │ │  │ │ │ Redis    │ │   │
│  │ │ (x1)     │ │ │  │ │ │ (shared) │ │   │
│  │ └──────────┘ │ │  │ │ └──────────┘ │   │
│  └──────────────┘ │  │ └──────────────┘   │
└─────────┬──────────┘  └──────────┬─────────┘
          │                        │
          │   Service Discovery    │
          │   (Cloud Map DNS)      │
          └────────────┬───────────┘
                       │
                       ▼
           ┌───────────────────────┐
           │  Database Subnets     │
           │  10.0.21.0/24         │
           │  10.0.22.0/24         │
           │                       │
           │  ┌─────────────────┐  │
           │  │  RDS PostgreSQL │  │
           │  │  Multi-AZ       │  │
           │  │  Primary + RO   │  │
           │  └─────────────────┘  │
           └───────────────────────┘

Core Components

1. Networking Layer

VPC (Virtual Private Cloud)

  • CIDR: 10.0.0.0/16
  • DNS Support: Enabled
  • DNS Hostnames: Enabled

Subnet Architecture

  • Public Subnets (2): Host ALB and NAT Gateways
    • CIDR: 10.0.1.0/24, 10.0.2.0/24
    • AZs: ap-south-1a, ap-south-1b
  • Private Subnets (2): Host ECS Fargate tasks
    • CIDR: 10.0.11.0/24, 10.0.12.0/24
    • AZs: ap-south-1a, ap-south-1b
  • Database Subnets (2): Host RDS PostgreSQL
    • CIDR: 10.0.21.0/24, 10.0.22.0/24
    • AZs: ap-south-1a, ap-south-1b

2. Compute Layer - ECS Fargate

ECS Cluster

  • Name: ecs-microservices-cluster
  • Type: Fargate (serverless)
  • Capacity Providers: FARGATE, FARGATE_SPOT
  • Container Insights: Enabled

Services Architecture

Nginx Service

  • Purpose: Reverse proxy and load distribution
  • Resources: 256 CPU (.25 vCPU), 512 MB Memory
  • Scaling: 2-4 tasks based on CPU
  • Health Check: /nginx-health endpoint

Flask API Service

  • Purpose: Backend application logic
  • Resources: 512 CPU (.5 vCPU), 1024 MB Memory
  • Scaling: 2-4 tasks based on CPU
  • Health Check: /health endpoint

Redis Service

  • Purpose: In-memory caching and session storage
  • Resources: 256 CPU (.25 vCPU), 512 MB Memory
  • Image: redis:7-alpine

3. Service Communication

ECS Service Connect

  • Namespace: ecs-microservices.local
  • DNS Resolution: Automatic
  • Service Discovery: Cloud Map
  • Benefits:
    • No need for ALB between services
    • Automatic health checking
    • Connection pooling
    • Observability

Service Mesh Communication

Nginx → flask-app:5000 (via Service Connect DNS)
Flask → redis:6379 (via Service Connect DNS)
Flask → RDS (via security group)

4. Load Balancing

Application Load Balancer

  • Type: Internet-facing
  • Scheme: HTTP/HTTPS
  • Target: Nginx service (IP targets)
  • Health Checks:
    • Path: /nginx-health
    • Interval: 30s
    • Healthy threshold: 2
    • Unhealthy threshold: 3

5. Database Layer

RDS PostgreSQL

  • Engine: PostgreSQL 15.4
  • Instance Class: db.t3.micro
  • Storage: 20 GB gp3 (encrypted)
  • Multi-AZ: Enabled (automatic failover)
  • Backup: 7 days retention
  • Connectivity: Private subnets only

6. Container Registry

Amazon ECR

  • Repositories:
    • ecs-microservices/flask-app
    • ecs-microservices/nginx
  • Image Scanning: Enabled on push
  • Encryption: AES256
  • Lifecycle Policy: Keep last 10 images

Security Architecture

Security Groups

ALB Security Group

  • Inbound: Port 80 (HTTP), Port 443 (HTTPS) from 0.0.0.0/0
  • Outbound: All traffic

ECS Tasks Security Group

  • Inbound: All ports from ALB security group, self-referencing for service mesh
  • Outbound: All traffic

RDS Security Group

  • Inbound: Port 5432 from ECS tasks security group
  • Outbound: All traffic

IAM Roles

ECS Task Execution Role

  • Pull images from ECR
  • Write logs to CloudWatch
  • Get secrets from Secrets Manager

ECS Task Role

  • ECS Exec (debugging)
  • Application-specific AWS API calls

Monitoring & Observability

CloudWatch Integration

Log Groups

  • /ecs/ecs-microservices/flask-app
  • /ecs/ecs-microservices/nginx
  • /ecs/ecs-microservices/redis
  • /ecs/ecs-microservices/exec

Metrics

  • Container Insights enabled
  • Custom application metrics
  • ALB metrics (requests, latency, errors)
  • RDS metrics (CPU, connections, IOPS)

Auto Scaling

Configuration

  • Metric: CPU Utilization
  • Target: 70%
  • Scale Out: Add task when > 70% for 60s
  • Scale In: Remove task when < 70% for 300s
  • Limits: Min 2, Max 4 tasks

CI/CD Pipeline

GitHub Actions Workflow

Automated Deployment Process

  1. Code Push → GitHub repository
  2. Build → Docker images for Flask and Nginx
  3. Push → Images to ECR with version tags
  4. Update → ECS task definitions
  5. Deploy → New versions to ECS services
  6. Verify → Deployment success

Deployment Strategy

  • Type: Rolling update
  • Blue-Green: Via task definition revisions
  • Rollback: Automatic on failure or manual

High Availability Features

Multi-AZ Deployment

  • ECS tasks in 2 availability zones
  • RDS Multi-AZ with automatic failover
  • ALB distributes across 2 AZs

Auto Scaling

  • Horizontal scaling based on CPU
  • Automatic task replacement on failure

Health Checks

  • ALB health checks
  • Container health checks
  • Service-level health checks

Redundancy

  • 2 NAT Gateways (one per AZ)
  • Multiple ECS tasks per service
  • RDS standby replica

Security Best Practices

Network Isolation

  • Private subnets for compute
  • Database-only subnets
  • No direct internet access to compute

Least Privilege

  • Minimal IAM permissions
  • Security group restrictions
  • No public RDS access

Encryption

  • RDS encryption at rest
  • TLS/HTTPS in transit
  • ECR encryption

Performance Characteristics

Response Times

  • Nginx → Flask: < 5ms (Service Connect)
  • Flask → Redis: < 2ms (in-memory)
  • Flask → RDS: < 10ms (same VPC)
  • End-to-End: < 100ms (p95)

Throughput

  • ALB: 1000s req/sec
  • Flask: ~100 req/sec per task
  • Redis: 100k ops/sec
  • RDS: Depends on instance class

Cost Optimization

Current Costs (~$167/month)

  • NAT Gateways: $65 (largest expense)
  • RDS Multi-AZ: $35
  • ECS Fargate: $40
  • ALB: $20
  • Other: $7

Optimization Strategies

  1. Use 1 NAT Gateway for dev (save $32/month)
  2. Use RDS Single-AZ for dev (save $18/month)
  3. Use Fargate Spot (save up to 70% on compute)
  4. Stop non-production environments when not in use
  5. Use Reserved Instances for RDS in production

Technology Stack

LayerTechnologyVersion
Container OrchestrationAWS ECS FargateLatest
Container RuntimeDocker24.x
Service MeshECS Service Connect-
ApplicationPython Flask3.0.0
Reverse ProxyNginx1.25
DatabasePostgreSQL15.4
CacheRedis7.x
IaCTerraform1.6+
CI/CDGitHub Actions-
MonitoringCloudWatch-

What’s Next?

In the upcoming parts of this series, we’ll dive deep into:

  • Part 2: Infrastructure setup with Terraform
  • Part 3: Application containerization with Docker
  • Part 4: ECS deployment and service configuration
  • Part 5: CI/CD pipeline with GitHub Actions
  • Part 6: Monitoring, logging, and troubleshooting
  • Part 7: Security hardening and best practices
  • Part 8: Cost optimization and cleanup

Key Takeaways

This architecture provides:

Production-ready microservices on AWS ECS Fargate
High availability across multiple AZs
Auto-scaling based on demand
Complete CI/CD automation
Security best practices implemented
Cost-optimized for production workloads
Infrastructure as Code with Terraform
Comprehensive monitoring and logging

This foundation will serve as the basis for building scalable, reliable, and maintainable cloud-native applications. Stay tuned for the next part where we’ll start implementing the infrastructure using Terraform!


Ready to build this architecture? Follow along with the complete series to implement this production-grade ECS microservices solution step by step. Here is the Part 2, where we’ll start implementing the infrastructure using Terraform!

Questions or feedback? Feel free to reach out in the comments below!

Table of Contents