Mastering the GitOps Dance: Building Automated CI/CD for Infrastructure and Apps

DevOps Engineer with 3 years of experience architecting and operating production-grade, cloud-native environments. Specialized in Kubernetes orchestration and Infrastructure as Code (Terraform, Ansible) to deliver "one-click" deployment solutions. Proven track record in implementing GitOps (Argo CD) and full-stack observability (Prometheus/Grafana), ensuring high availability and system transparency from the networking layer to the application.
Introduction
Have you ever had that sinking feeling when a teammate asks, “Who updated the production server?“ and the room goes silent? Or perhaps you’ve experienced the “works on my machine“ curse, where a deployment fails because a manual configuration step was missed.
In the fast-paced world of cloud/DevOps engineering, manual steps are the enemy of reliability. This week, I set out to move beyond the manual “click-and-pray“ method. My goal here was to implement a GitOps workflow, a pattern in which Git is the single “source of truth” for everything. If it isn’t in the code, it doesn’t exist in the infrastructure.
In this project, I built a self-healing pipeline that provisions AWS infrastructure with Terraform, configures a monitoring stack with Ansible, and keeps a three-tier application up to date with GitHub Actions.
The Vision
Infrastructure as Code (IaC): Every EC2 instance, security group, and VPC defined in Terraform.
GitOps Branching: Using a strict branch (
infra_features,infra_main,integrationanddeployment) hierarchy to ensure that every change is reviewed and tested before it ever touches production.Automated Monitoring: A full observability stack (Prometheus, Grafana, Loki) that deploys itself as soon as the server is created.

The Infrastructure Pipeline - Planning for the Future and Your Wallet
Infrastructure deployment shouldn’t be a black box. One of the most insightful parts of this workflow was the integration of Infracost.
The Branching Logic
infra_features(The Sandbox): This is where the work begins. On every push to this branch, theterraform-validate.ymlworkflow triggers. It acts as the first line of defence, ensuring that syntax is perfect before any human even reviews the code. Theterraform-validate.ymlis as shown below.name: "Terraform Validate" on: push: branches: - infra_features jobs: validate: runs-on: ubuntu-latest steps: - name: Checkout Code uses: actions/checkout@v4 - name: Setup Terraform uses: hashicorp/setup-terraform@v3 with: terraform_version: 1.5.0 - name: Terraform Init # We use -backend=false because validation only checks syntax, # it doesn't need to log into your S3 bucket yet. run: terraform -chdir=terraform init -backend=false - name: Terraform Validate run: terraform -chdir=terraform validate
infra_main(The Source of Truth): This branch represents our live environment.
The Cost-Aware Review
The real magic happens during the Pull Request from infra_features to infra_main. This triggers the terraform-plan.yml script. Beyond just showing me what resources would be created, it integrated Infracost to post a cost estimation directly as a PR comment.
Even if the report shows $0.00, which often happens when using Free Tier resources like t2.micro or when the PR doesn't change existing costs, having this visibility is a game-changer. It forces a culture of "Cost-Ops," where financial impact is reviewed alongside technical changes.
name: "Terraform Plan & Cost"
on:
pull_request:
branches:
- infra_main
jobs:
plan-and-cost:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- name: Checkout Code
uses: actions/checkout@v4
- name: Setup Infracost
uses: infracost/actions/setup@v3
with:
api-key: ${{ secrets.INFRACOST_API_KEY }}
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.5.0
- name: Terraform Init
run: terraform -chdir=terraform init
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Terraform Plan
run: terraform -chdir=terraform plan -out=tfplan
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Export Terraform Plan JSON
run: terraform -chdir=terraform show -json tfplan > /tmp/tfplan.json
- name: Generate Infracost JSON
run: |
infracost diff --path /tmp/tfplan.json \
--format json \
--out-file /tmp/infracost.json
- name: Post Infracost comment
run: |
infracost comment github \
--path /tmp/infracost.json \
--repo $GITHUB_REPOSITORY \
--github-token ${{ github.token }} \
--pull-request ${{ github.event.pull_request.number }} \
--behavior update




The Automated Handover: Terraform to Ansible
Once the PR is merged into infra_main, the terraform-apply.yml workflow takes over. It doesn't just provision the EC2 instance; it automatically triggers the Ansible playbook for the deployment of the monitoring stack. This is where Terraform + Ansible integration shines: Terraform builds the "house" (the server), and Ansible immediately moves in the "furniture" (the Prometheus, Grafana, and Loki monitoring stack).
name: "Terraform Apply & Monitoring"
on:
push:
branches:
- infra_main # Runs ONLY when code is merged into infra_main
jobs:
apply-and-ansible:
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.5.0
- name: Terraform Init
run: terraform -chdir=terraform init
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Terraform Apply
run: terraform -chdir=terraform apply -auto-approve
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
# Prepare for Ansible
- name: Get EC2 IP
id: get_ip
run: |
cd terraform
echo "PUBLIC_IP=$(terraform output -raw static_ip)" >> $GITHUB_OUTPUT
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Setup SSH Key
run: |
echo "${{ secrets.SSH_PRIVATE_KEY }}" > private_key.pem
chmod 600 private_key.pem
- name: Run Ansible Monitoring Setup
run: |
echo "Waiting for SSH to be ready..."
sleep 30
export ANSIBLE_HOST_KEY_CHECKING=False
ansible-playbook -i "${{ steps.get_ip.outputs.PUBLIC_IP }}," \
-u ubuntu \
--private-key private_key.pem \
ansible/playbook.yml \
--tags "common, monitoring"

The Application Pipeline - Managing the “Robot” and Syncing Realities
Once the infrastructure was alive and the monitoring stack was standing guard, it was time to deploy the application. For this, I shifted from the infra branches to the application-specific workflow: integration and deployment.
The CI Stage: Building and Automated Tagging
In the integration branch, the focus is pure Continuous Integration. Every push triggers the ci-application.yml. This script does the heavy lifting: it builds the Docker images for the frontend and backend, tags them with the unique Git SHA, and pushes them to Docker Hub.
But here is where it gets interesting and where I faced my first real challenge. To maintain a true GitOps flow, the docker-compose.yml file in the app repository needs to reflect the exact version of the images currently in Docker Hub. I implemented a "Robot" step in the CI pipeline that automatically updates the image tags in the compose file and commits that change back to the repository.
Insight from the Trenches: The "Divergent Branch" Headache. Because the GitHub Actions "Robot" was committing changes to the remote integration branch while I was making local code updates, I quickly ran into "failed to push" errors due to divergent histories. This was a fantastic learning moment. I had to adopt a disciplined git pull --rebase habit to ensure my local environment was always synced with the automated updates happening in the cloud. It’s a perfect example of how automation forces you to become a better Git user.
name: "CI - Build and Tag Application"
on:
push:
branches:
- integration
jobs:
build-and-push:
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v4
with:
token: ${{ secrets.GITHUB_TOKEN }}
- name: Login to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Build and Push Backend
uses: docker/build-push-action@v5
with:
context: ./app/backend
push: true
tags: ${{ secrets.DOCKERHUB_USERNAME }}/backend:${{ github.sha }}
- name: Build and Push Frontend
uses: docker/build-push-action@v5
with:
context: ./app/frontend
push: true
tags: ${{ secrets.DOCKERHUB_USERNAME }}/frontend:${{ github.sha }}
- name: Update Image Tags in Docker Compose
run: |
# This command finds the 'image:' line and replaces the tag with the unique Git SHA
sed -i "s|image: .*/backend:.*|image: ${{ secrets.DOCKERHUB_USERNAME }}/backend:${{ github.sha }}|g" app/docker-compose.yml
sed -i "s|image: .*/frontend:.*|image: ${{ secrets.DOCKERHUB_USERNAME }}/frontend:${{ github.sha }}|g" app/docker-compose.yml
- name: Commit and Push Changes
run: |
git config --global user.name "github-actions"
git config --global user.email "github-actions@github.com"
git add app/docker-compose.yml
git commit -m "chore: update images to ${{ github.sha }} [skip ci]"
git push origin integration


The updated docker-compose.yml with the latest tags which serves as the manifest is shown below.
services:
traefik:
image: traefik:v3.1
container_name: traefik
restart: always
command:
- "--api.insecure=true"
- "--api.dashboard=true"
- "--providers.docker=true"
- "--providers.docker.exposedbydefault=false"
- "--providers.docker.network=web_net"
- "--entrypoints.web.address=:80"
ports:
- "80:80"
- "8080:8080"
volumes:
- "/var/run/docker.sock:/var/run/docker.sock:ro"
networks:
- web_net
cv-db:
image: postgres:14
container_name: cv-db
restart: unless-stopped
environment:
POSTGRES_USER: app
POSTGRES_PASSWORD: changethis123
POSTGRES_DB: app
volumes:
- postgres_data:/var/lib/postgresql/data
networks:
- app_net
cv-backend:
# CI Pipeline will replace this tag
image: iamay0bami/backend:e3f0aa13135961951248adc885193198ebc08b44 # Updated tag matching DockerHub image tag
container_name: cv-backend
restart: always
environment:
# We use ${DOMAIN} which will be set on the server during CD
- DOMAIN=${DOMAIN}
- ENVIRONMENT=production
- PROJECT_NAME=Full Stack FastAPI Project
- STACK_NAME=full-stack-fastapi-project
- API_V1_STR=/api/v1
- SECRET_KEY=changethis123
- FIRST_SUPERUSER=admin@example.com
- FIRST_SUPERUSER_PASSWORD=changethis123
- POSTGRES_SERVER=cv-db
- POSTGRES_PORT=5432
- POSTGRES_DB=app
- POSTGRES_USER=app
- POSTGRES_PASSWORD=changethis123
# Hardcoding CORS for simplicity or using ${DOMAIN}
- BACKEND_CORS_ORIGINS=http://${DOMAIN},http://localhost,http://localhost:5173
# THE CRITICAL DB SEEDING COMMAND
command: >
bash -c "PYTHONPATH=/app alembic upgrade head &&
PYTHONPATH=/app python app/initial_data.py &&
uvicorn app.main:app --host 0.0.0.0 --port 8000"
labels:
- "traefik.enable=true"
- "traefik.docker.network=web_net"
- "traefik.http.routers.backend.rule=PathPrefix(`/api/v1`) || PathPrefix(`/docs`) || PathPrefix(`/redoc`) || PathPrefix(`/openapi.json`)"
- "traefik.http.routers.backend.entrypoints=web"
- "traefik.http.services.backend.loadbalancer.server.port=8000"
networks:
- app_net
- web_net
depends_on:
- cv-db
cv-frontend:
# CI Pipeline will replace this tag
image: iamay0bami/frontend:e3f0aa13135961951248adc885193198ebc08b44 # Updated tag matching DockerHub image tag
container_name: cv-frontend
restart: always
environment:
- VITE_API_URL=http://${DOMAIN}/api
labels:
- "traefik.enable=true"
- "traefik.http.routers.frontend.rule=PathPrefix(`/`)"
- "traefik.http.routers.frontend.entrypoints=web"
- "traefik.http.routers.frontend.priority=1"
- "traefik.http.services.frontend.loadbalancer.server.port=80"
networks:
- web_net
depends_on:
- cv-backend
networks:
web_net:
external: true
app_net:
driver: bridge
volumes:
postgres_data:
The CD Stage: Shipping to Production
The final step in the journey is the merge from integration to the deployment branch. This triggers the cd-application.yml workflow.
This script is the closer. It doesn't build anything; instead, it uses SSH to securely connect to the AWS EC2 instance created by Terraform. It pulls the updated docker-compose.yml (the one the Robot just edited) and executes a docker compose up -d. Because we are using Traefik as our reverse proxy, the new versions of the app are picked up instantly and served to the user without me having to touch a single configuration file on the server.
name: "CD - Deploy Application"
on:
push:
branches:
- deployment
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v4
- name: Copy docker-compose to server
uses: appleboy/scp-action@v0.1.7
with:
host: ${{ secrets.SERVER_IP }}
username: ubuntu
key: ${{ secrets.SSH_PRIVATE_KEY }}
source: "app/docker-compose.yml"
target: "/home/ubuntu/deploy"
strip_components: 1
- name: Deploy to Server via SSH
uses: appleboy/ssh-action@v1.0.3
with:
host: ${{ secrets.SERVER_IP }}
username: ubuntu
key: ${{ secrets.SSH_PRIVATE_KEY }}
script_stop: true
script: |
cd /home/ubuntu/deploy
# Export the IP as the DOMAIN so Docker Compose can pick it up
export DOMAIN=${{ secrets.SERVER_IP }}
docker network create web_net || true
# Login to Docker Hub (Required to pull your private/new images)
echo "${{ secrets.DOCKERHUB_TOKEN }}" | docker login -u "${{ secrets.DOCKERHUB_USERNAME }}" --password-stdin
# Pull the specific tags defined in the updated docker-compose.yml
docker compose pull
# Start the application
docker compose up -d --remove-orphans
# Optional: Clean up old unused images to save disk space
docker image prune -f


The Relief Moment – Solving the SPA Routing Mystery
No project is complete without a bug hunt. While the deployment was successful, I noticed that reloading the React frontend on any page other than the home screen resulted in a dreaded Nginx 404 error.
This wasn't a pipeline error; it was a routing mismatch. In a Single Page Application (SPA), the browser tries to find a physical file for a route like /login. I had to go back to the drawing board, create a custom nginx.default.conf to redirect all traffic to index.html, and update my Frontend Dockerfile to include this configuration. Seeing the site finally reload perfectly after a fresh pipeline run was the most satisfying moment of the week.
# frontend/Dockerfile
FROM node:18-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build
# Serve with a lightweight web server
FROM nginx:alpine
COPY --from=build /app/dist /usr/share/nginx/html
COPY nginx.default.conf /etc/nginx/conf.d/default.conf
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]
server {
listen 80;
server_name _;
root /usr/share/nginx/html;
# Serve static assets normally
location ~* \.(?:js|css|png|jpg|jpeg|gif|svg|ico|map|woff2?|woff|ttf)$ {
try_files $uri =404;
access_log off;
expires 1y;
}
location /assets/ {
try_files $uri =404;
access_log off;
expires 1y;
}
# SPA fallback: for everything else return index.html so client router can handle it
location / {
try_files $uri $uri/ /index.html;
}
}


The Watchtower – Monitoring, Observability, and Traefik
Deploying an application is one thing; knowing it’s healthy is another. One of the most satisfying parts of this project was the seamless integration of the Monitoring Stack.
The Automated Handover
Remember our infra_main workflow? As soon as Terraform finished provisioning the EC2 instance, it triggered the monitoring_stack playbook.
This script didn't just install packages; it orchestrated a sophisticated observability environment. By the time I checked my browser, Prometheus was already scraping metrics, Loki was indexing logs via Promtail, and Grafana was ready to visualize it all.
---
- name: Create monitoring and provisioning directories
file:
path: "{{ item }}"
state: directory
loop:
- /home/ubuntu/monitoring
- /home/ubuntu/monitoring/provisioning/datasources
- /home/ubuntu/monitoring/provisioning/dashboards
- /home/ubuntu/monitoring/dashboards
- name: Deploy main configuration files
template:
src: "{{ item.src }}"
dest: "/home/ubuntu/monitoring/{{ item.dest }}"
loop:
- { src: 'prometheus.yml.j2', dest: 'prometheus.yml' }
- { src: 'loki-config.yml.j2', dest: 'loki-config.yml' }
- { src: 'promtail-config.yml.j2', dest: 'promtail-config.yml' }
- { src: 'docker-compose.monitoring.yml.j2', dest: 'docker-compose.yml' }
- name: Deploy Grafana provisioning
template:
src: "{{ item.src }}"
dest: "/home/ubuntu/monitoring/provisioning/{{ item.dest }}"
loop:
- { src: 'datasources.yml.j2', dest: 'datasources/datasources.yml' }
- { src: 'dashboards.yml.j2', dest: 'dashboards/dashboards.yml' }
- name: Copy dashboard JSON files
copy:
src: dashboards/
dest: /home/ubuntu/monitoring/dashboards/
mode: '0644'
- name: Start Monitoring Stack
community.docker.docker_compose_v2:
project_src: /home/ubuntu/monitoring
state: present
recreate: always
remove_orphans: yes
The Magic of Path-Based Routing
In my previous setups, I’d have to open a dozen ports (3000 for Grafana, 9090 for Prometheus, etc.) in the AWS Security Group. Not here. Thanks to Traefik, I implemented a production-grade reverse proxy. Everything—my frontend, backend, and monitoring tools—is served over Port 80.
Whether it’s http://<IP>/grafana/ or http://<IP>/prometheus/, Traefik handles the routing logic using Docker labels. This keeps the attack surface small and the URL structure clean.




Conclusion
Reflecting on this project, the true "Aha!" moment wasn't just getting the code to run; it was realizing that I had built a reproducible system. If I were to delete my entire AWS instance right now, I could recreate the entire infrastructure, monitoring, and application with just a few Git commands.
Lessons Learned:
Infrastructure is Code: Using
terraform-validate.ymlandterraform-plan.ymlacrossinfra_featuresandinfra_mainensures that your cloud environment is never a mystery.Automation requires Discipline: Dealing with the "Robot" commits in the
integrationbranch taught me more about Git branch management and merge conflicts than any tutorial ever could.Observability is non-negotiable: Having a monitoring stack that deploys with the infrastructure ensures you are never flying blind from Day 1.
GitOps isn't just about using Git; it's about a cultural shift where every change is auditable, version-controlled, and collaborative. By treating our infrastructure and pipelines with the same respect as our feature code, we build systems that aren't just fast they’re resilient.


