Skip to main content

Command Palette

Search for a command to run...

Mastering the GitOps Dance: Building Automated CI/CD for Infrastructure and Apps

Published
11 min read
Mastering the GitOps Dance: Building Automated CI/CD for Infrastructure and Apps
A

DevOps Engineer with 3 years of experience architecting and operating production-grade, cloud-native environments. Specialized in Kubernetes orchestration and Infrastructure as Code (Terraform, Ansible) to deliver "one-click" deployment solutions. Proven track record in implementing GitOps (Argo CD) and full-stack observability (Prometheus/Grafana), ensuring high availability and system transparency from the networking layer to the application.

Introduction

Have you ever had that sinking feeling when a teammate asks, “Who updated the production server?“ and the room goes silent? Or perhaps you’ve experienced the “works on my machine“ curse, where a deployment fails because a manual configuration step was missed.

In the fast-paced world of cloud/DevOps engineering, manual steps are the enemy of reliability. This week, I set out to move beyond the manual “click-and-pray“ method. My goal here was to implement a GitOps workflow, a pattern in which Git is the single “source of truth” for everything. If it isn’t in the code, it doesn’t exist in the infrastructure.

In this project, I built a self-healing pipeline that provisions AWS infrastructure with Terraform, configures a monitoring stack with Ansible, and keeps a three-tier application up to date with GitHub Actions.

The Vision

  • Infrastructure as Code (IaC): Every EC2 instance, security group, and VPC defined in Terraform.

  • GitOps Branching: Using a strict branch (infra_features, infra_main, integration and deployment) hierarchy to ensure that every change is reviewed and tested before it ever touches production.

  • Automated Monitoring: A full observability stack (Prometheus, Grafana, Loki) that deploys itself as soon as the server is created.

The Infrastructure Pipeline - Planning for the Future and Your Wallet

Infrastructure deployment shouldn’t be a black box. One of the most insightful parts of this workflow was the integration of Infracost.

The Branching Logic

  • infra_features (The Sandbox): This is where the work begins. On every push to this branch, the terraform-validate.yml workflow triggers. It acts as the first line of defence, ensuring that syntax is perfect before any human even reviews the code. The terraform-validate.yml is as shown below.

      name: "Terraform Validate"
    
      on:
        push:
          branches:
            - infra_features
    
      jobs:
        validate:
          runs-on: ubuntu-latest
          steps:
            - name: Checkout Code
              uses: actions/checkout@v4
    
            - name: Setup Terraform
              uses: hashicorp/setup-terraform@v3
              with:
                terraform_version: 1.5.0
    
            - name: Terraform Init
              # We use -backend=false because validation only checks syntax,
              # it doesn't need to log into your S3 bucket yet.
              run: terraform -chdir=terraform init -backend=false
    
            - name: Terraform Validate
              run: terraform -chdir=terraform validate
    

  • infra_main (The Source of Truth): This branch represents our live environment.

The Cost-Aware Review

The real magic happens during the Pull Request from infra_features to infra_main. This triggers the terraform-plan.yml script. Beyond just showing me what resources would be created, it integrated Infracost to post a cost estimation directly as a PR comment.

Even if the report shows $0.00, which often happens when using Free Tier resources like t2.micro or when the PR doesn't change existing costs, having this visibility is a game-changer. It forces a culture of "Cost-Ops," where financial impact is reviewed alongside technical changes.

name: "Terraform Plan & Cost"

on:
  pull_request:
    branches:
      - infra_main

jobs:
  plan-and-cost:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - name: Checkout Code
        uses: actions/checkout@v4

      - name: Setup Infracost
        uses: infracost/actions/setup@v3
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }} 

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.5.0

      - name: Terraform Init
        run: terraform -chdir=terraform init
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      - name: Terraform Plan
        run: terraform -chdir=terraform plan -out=tfplan
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      - name: Export Terraform Plan JSON
        run: terraform -chdir=terraform show -json tfplan > /tmp/tfplan.json

      - name: Generate Infracost JSON
        run: |
          infracost diff --path /tmp/tfplan.json \
                          --format json \
                          --out-file /tmp/infracost.json

      - name: Post Infracost comment
        run: |
          infracost comment github \
            --path /tmp/infracost.json \
            --repo $GITHUB_REPOSITORY \
            --github-token ${{ github.token }} \
            --pull-request ${{ github.event.pull_request.number }} \
            --behavior update

The Automated Handover: Terraform to Ansible

Once the PR is merged into infra_main, the terraform-apply.yml workflow takes over. It doesn't just provision the EC2 instance; it automatically triggers the Ansible playbook for the deployment of the monitoring stack. This is where Terraform + Ansible integration shines: Terraform builds the "house" (the server), and Ansible immediately moves in the "furniture" (the Prometheus, Grafana, and Loki monitoring stack).

name: "Terraform Apply & Monitoring"

on:
  push:
    branches:
      - infra_main  # Runs ONLY when code is merged into infra_main

jobs:
  apply-and-ansible:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.5.0

      - name: Terraform Init
        run: terraform -chdir=terraform init
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      - name: Terraform Apply
        run: terraform -chdir=terraform apply -auto-approve
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      # Prepare for Ansible
      - name: Get EC2 IP
        id: get_ip
        run: |
          cd terraform
          echo "PUBLIC_IP=$(terraform output -raw static_ip)" >> $GITHUB_OUTPUT
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      - name: Setup SSH Key
        run: |
          echo "${{ secrets.SSH_PRIVATE_KEY }}" > private_key.pem
          chmod 600 private_key.pem

      - name: Run Ansible Monitoring Setup
        run: |
          echo "Waiting for SSH to be ready..."
          sleep 30 
          export ANSIBLE_HOST_KEY_CHECKING=False
          ansible-playbook -i "${{ steps.get_ip.outputs.PUBLIC_IP }}," \
            -u ubuntu \
            --private-key private_key.pem \
            ansible/playbook.yml \
            --tags "common, monitoring"

The Application Pipeline - Managing the “Robot” and Syncing Realities

Once the infrastructure was alive and the monitoring stack was standing guard, it was time to deploy the application. For this, I shifted from the infra branches to the application-specific workflow: integration and deployment.

The CI Stage: Building and Automated Tagging

In the integration branch, the focus is pure Continuous Integration. Every push triggers the ci-application.yml. This script does the heavy lifting: it builds the Docker images for the frontend and backend, tags them with the unique Git SHA, and pushes them to Docker Hub.

But here is where it gets interesting and where I faced my first real challenge. To maintain a true GitOps flow, the docker-compose.yml file in the app repository needs to reflect the exact version of the images currently in Docker Hub. I implemented a "Robot" step in the CI pipeline that automatically updates the image tags in the compose file and commits that change back to the repository.

Insight from the Trenches: The "Divergent Branch" Headache. Because the GitHub Actions "Robot" was committing changes to the remote integration branch while I was making local code updates, I quickly ran into "failed to push" errors due to divergent histories. This was a fantastic learning moment. I had to adopt a disciplined git pull --rebase habit to ensure my local environment was always synced with the automated updates happening in the cloud. It’s a perfect example of how automation forces you to become a better Git user.

name: "CI - Build and Tag Application"

on:
  push:
    branches:
      - integration

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v4
        with:
          token: ${{ secrets.GITHUB_TOKEN }}

      - name: Login to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}

      - name: Build and Push Backend
        uses: docker/build-push-action@v5
        with:
          context: ./app/backend
          push: true
          tags: ${{ secrets.DOCKERHUB_USERNAME }}/backend:${{ github.sha }}

      - name: Build and Push Frontend
        uses: docker/build-push-action@v5
        with:
          context: ./app/frontend
          push: true
          tags: ${{ secrets.DOCKERHUB_USERNAME }}/frontend:${{ github.sha }}

      - name: Update Image Tags in Docker Compose
        run: |
          # This command finds the 'image:' line and replaces the tag with the unique Git SHA
          sed -i "s|image: .*/backend:.*|image: ${{ secrets.DOCKERHUB_USERNAME }}/backend:${{ github.sha }}|g" app/docker-compose.yml
          sed -i "s|image: .*/frontend:.*|image: ${{ secrets.DOCKERHUB_USERNAME }}/frontend:${{ github.sha }}|g" app/docker-compose.yml

      - name: Commit and Push Changes
        run: |
          git config --global user.name "github-actions"
          git config --global user.email "github-actions@github.com"
          git add app/docker-compose.yml
          git commit -m "chore: update images to ${{ github.sha }} [skip ci]"
          git push origin integration

The updated docker-compose.yml with the latest tags which serves as the manifest is shown below.

services:
  traefik:
    image: traefik:v3.1
    container_name: traefik
    restart: always
    command:
      - "--api.insecure=true"
      - "--api.dashboard=true"
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      - "--providers.docker.network=web_net"
      - "--entrypoints.web.address=:80"
    ports:
      - "80:80"
      - "8080:8080"
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
    networks:
      - web_net

  cv-db:
    image: postgres:14
    container_name: cv-db
    restart: unless-stopped
    environment:
      POSTGRES_USER: app
      POSTGRES_PASSWORD: changethis123
      POSTGRES_DB: app
    volumes:
      - postgres_data:/var/lib/postgresql/data
    networks:
      - app_net

  cv-backend:
    # CI Pipeline will replace this tag
    image: iamay0bami/backend:e3f0aa13135961951248adc885193198ebc08b44 # Updated tag matching DockerHub image tag
    container_name: cv-backend
    restart: always
    environment:
      # We use ${DOMAIN} which will be set on the server during CD
      - DOMAIN=${DOMAIN}
      - ENVIRONMENT=production
      - PROJECT_NAME=Full Stack FastAPI Project
      - STACK_NAME=full-stack-fastapi-project
      - API_V1_STR=/api/v1
      - SECRET_KEY=changethis123
      - FIRST_SUPERUSER=admin@example.com
      - FIRST_SUPERUSER_PASSWORD=changethis123
      - POSTGRES_SERVER=cv-db
      - POSTGRES_PORT=5432
      - POSTGRES_DB=app
      - POSTGRES_USER=app
      - POSTGRES_PASSWORD=changethis123
      # Hardcoding CORS for simplicity or using ${DOMAIN}
      - BACKEND_CORS_ORIGINS=http://${DOMAIN},http://localhost,http://localhost:5173
    # THE CRITICAL DB SEEDING COMMAND
    command: >
      bash -c "PYTHONPATH=/app alembic upgrade head && 
               PYTHONPATH=/app python app/initial_data.py && 
               uvicorn app.main:app --host 0.0.0.0 --port 8000"
    labels:
      - "traefik.enable=true"
      - "traefik.docker.network=web_net"
      - "traefik.http.routers.backend.rule=PathPrefix(`/api/v1`) || PathPrefix(`/docs`) || PathPrefix(`/redoc`) || PathPrefix(`/openapi.json`)"
      - "traefik.http.routers.backend.entrypoints=web"
      - "traefik.http.services.backend.loadbalancer.server.port=8000"
    networks:
      - app_net
      - web_net
    depends_on:
      - cv-db

  cv-frontend:
    # CI Pipeline will replace this tag
    image: iamay0bami/frontend:e3f0aa13135961951248adc885193198ebc08b44 # Updated tag matching DockerHub image tag
    container_name: cv-frontend
    restart: always
    environment:
      - VITE_API_URL=http://${DOMAIN}/api
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.frontend.rule=PathPrefix(`/`)"
      - "traefik.http.routers.frontend.entrypoints=web"
      - "traefik.http.routers.frontend.priority=1"
      - "traefik.http.services.frontend.loadbalancer.server.port=80"
    networks:
      - web_net
    depends_on:
      - cv-backend

networks:
  web_net:
    external: true
  app_net:
    driver: bridge

volumes:
  postgres_data:

The CD Stage: Shipping to Production

The final step in the journey is the merge from integration to the deployment branch. This triggers the cd-application.yml workflow.

This script is the closer. It doesn't build anything; instead, it uses SSH to securely connect to the AWS EC2 instance created by Terraform. It pulls the updated docker-compose.yml (the one the Robot just edited) and executes a docker compose up -d. Because we are using Traefik as our reverse proxy, the new versions of the app are picked up instantly and served to the user without me having to touch a single configuration file on the server.

name: "CD - Deploy Application"

on:
  push:
    branches:
      - deployment

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v4

      - name: Copy docker-compose to server
        uses: appleboy/scp-action@v0.1.7
        with:
          host: ${{ secrets.SERVER_IP }}
          username: ubuntu
          key: ${{ secrets.SSH_PRIVATE_KEY }}
          source: "app/docker-compose.yml"
          target: "/home/ubuntu/deploy"
          strip_components: 1

      - name: Deploy to Server via SSH
        uses: appleboy/ssh-action@v1.0.3
        with:
          host: ${{ secrets.SERVER_IP }}
          username: ubuntu
          key: ${{ secrets.SSH_PRIVATE_KEY }}
          script_stop: true
          script: |
            cd /home/ubuntu/deploy

            # Export the IP as the DOMAIN so Docker Compose can pick it up
            export DOMAIN=${{ secrets.SERVER_IP }}

            docker network create web_net || true

            # Login to Docker Hub (Required to pull your private/new images)
            echo "${{ secrets.DOCKERHUB_TOKEN }}" | docker login -u "${{ secrets.DOCKERHUB_USERNAME }}" --password-stdin

            # Pull the specific tags defined in the updated docker-compose.yml
            docker compose pull

            # Start the application
            docker compose up -d --remove-orphans

            # Optional: Clean up old unused images to save disk space
            docker image prune -f

The Relief Moment – Solving the SPA Routing Mystery

No project is complete without a bug hunt. While the deployment was successful, I noticed that reloading the React frontend on any page other than the home screen resulted in a dreaded Nginx 404 error.

This wasn't a pipeline error; it was a routing mismatch. In a Single Page Application (SPA), the browser tries to find a physical file for a route like /login. I had to go back to the drawing board, create a custom nginx.default.conf to redirect all traffic to index.html, and update my Frontend Dockerfile to include this configuration. Seeing the site finally reload perfectly after a fresh pipeline run was the most satisfying moment of the week.

# frontend/Dockerfile
FROM node:18-alpine AS build

WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build

# Serve with a lightweight web server
FROM nginx:alpine
COPY --from=build /app/dist /usr/share/nginx/html
COPY nginx.default.conf /etc/nginx/conf.d/default.conf
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]
server {
  listen 80;
  server_name _;

  root /usr/share/nginx/html;

  # Serve static assets normally
  location ~* \.(?:js|css|png|jpg|jpeg|gif|svg|ico|map|woff2?|woff|ttf)$ {
    try_files $uri =404;
    access_log off;
    expires 1y;
  }

  location /assets/ {
    try_files $uri =404;
    access_log off;
    expires 1y;
  }

  # SPA fallback: for everything else return index.html so client router can handle it
  location / {
    try_files $uri $uri/ /index.html;
  }
}

The Watchtower – Monitoring, Observability, and Traefik

Deploying an application is one thing; knowing it’s healthy is another. One of the most satisfying parts of this project was the seamless integration of the Monitoring Stack.

The Automated Handover

Remember our infra_main workflow? As soon as Terraform finished provisioning the EC2 instance, it triggered the monitoring_stack playbook.

This script didn't just install packages; it orchestrated a sophisticated observability environment. By the time I checked my browser, Prometheus was already scraping metrics, Loki was indexing logs via Promtail, and Grafana was ready to visualize it all.

---
- name: Create monitoring and provisioning directories
  file:
    path: "{{ item }}"
    state: directory
  loop:
    - /home/ubuntu/monitoring
    - /home/ubuntu/monitoring/provisioning/datasources
    - /home/ubuntu/monitoring/provisioning/dashboards
    - /home/ubuntu/monitoring/dashboards

- name: Deploy main configuration files
  template:
    src: "{{ item.src }}"
    dest: "/home/ubuntu/monitoring/{{ item.dest }}"
  loop:
    - { src: 'prometheus.yml.j2', dest: 'prometheus.yml' }
    - { src: 'loki-config.yml.j2', dest: 'loki-config.yml' }
    - { src: 'promtail-config.yml.j2', dest: 'promtail-config.yml' }
    - { src: 'docker-compose.monitoring.yml.j2', dest: 'docker-compose.yml' }

- name: Deploy Grafana provisioning
  template:
    src: "{{ item.src }}"
    dest: "/home/ubuntu/monitoring/provisioning/{{ item.dest }}"
  loop:
    - { src: 'datasources.yml.j2', dest: 'datasources/datasources.yml' }
    - { src: 'dashboards.yml.j2', dest: 'dashboards/dashboards.yml' }


- name: Copy dashboard JSON files
  copy:
    src: dashboards/  
    dest: /home/ubuntu/monitoring/dashboards/
    mode: '0644'

- name: Start Monitoring Stack
  community.docker.docker_compose_v2:
    project_src: /home/ubuntu/monitoring
    state: present
    recreate: always
    remove_orphans: yes

The Magic of Path-Based Routing

In my previous setups, I’d have to open a dozen ports (3000 for Grafana, 9090 for Prometheus, etc.) in the AWS Security Group. Not here. Thanks to Traefik, I implemented a production-grade reverse proxy. Everything—my frontend, backend, and monitoring tools—is served over Port 80.

Whether it’s http://<IP>/grafana/ or http://<IP>/prometheus/, Traefik handles the routing logic using Docker labels. This keeps the attack surface small and the URL structure clean.

Conclusion

Reflecting on this project, the true "Aha!" moment wasn't just getting the code to run; it was realizing that I had built a reproducible system. If I were to delete my entire AWS instance right now, I could recreate the entire infrastructure, monitoring, and application with just a few Git commands.

Lessons Learned:

  • Infrastructure is Code: Using terraform-validate.yml and terraform-plan.yml across infra_features and infra_main ensures that your cloud environment is never a mystery.

  • Automation requires Discipline: Dealing with the "Robot" commits in the integration branch taught me more about Git branch management and merge conflicts than any tutorial ever could.

  • Observability is non-negotiable: Having a monitoring stack that deploys with the infrastructure ensures you are never flying blind from Day 1.

GitOps isn't just about using Git; it's about a cultural shift where every change is auditable, version-controlled, and collaborative. By treating our infrastructure and pipelines with the same respect as our feature code, we build systems that aren't just fast they’re resilient.