Deployment & Infrastructure - 2/2
From Infrastructure Foundations to Production Excellence
You’ve mastered professional deployment strategies with rolling updates, blue-green deployments, and canary releases that handle traffic gracefully, established cloud-native infrastructure on AWS, GCP, and Azure with proper networking and security, implemented Infrastructure as Code with Terraform for consistent, version-controlled provisioning, and set up load balancers, CDNs, and server management that scales horizontally. Your infrastructure now operates as enterprise-grade systems that deploy reliably and scale automatically. But here’s the production reality that separates functional infrastructure from world-class operations: perfect infrastructure means nothing if your deployment process requires manual intervention, lacks comprehensive monitoring that detects issues before customers notice, has no disaster recovery plan for when things go catastrophically wrong, and operates without the CI/CD automation that enables teams to deploy confidently dozens of times per day.
The production operations nightmare that destroys scalable businesses:
# Your operations horror story
# CEO: "We need to deploy the critical bug fix NOW, customers are churning"
# Attempt 1: Manual deployment at 2 AM
$ ssh production-server
production$ git pull origin main
# Merge conflict in critical configuration file
# No automated tests, deploying blind
$ sudo systemctl restart myapp
Job for myapp.service failed because the control process exited with error code.
# Service won't start, no clear error logs
$ sudo journalctl -u myapp
# 50,000 lines of generic logs, needle in haystack
# No structured logging, no error aggregation
# Attempt 2: Emergency rollback
$ git log --oneline
# 47 commits since last known good state
# No release tags, no deployment tracking
# Which commit was actually deployed last?
$ git checkout HEAD~5
$ sudo systemctl restart myapp
# Service starts but database migrations are incompatible
# Data corruption in production database
# Attempt 3: Infrastructure disaster
# Primary database server dies during peak traffic
$ aws rds describe-db-instances --db-instance-identifier prod-db
{
"DBInstanceStatus": "failed"
}
# No automated failover, no backups tested in 6 months
# Customer data potentially lost forever
# Attempt 4: Monitoring blindness
# Load balancer shows 500 errors for 30 minutes
# First alert comes from angry customer on Twitter
# No proactive monitoring, no alerting
# "Why didn't anyone tell us the site was down?"
# The cascading operations disasters:
# - No CI/CD pipeline, deployments via "git pull and pray"
# - No automated testing, bugs discovered by customers
# - No monitoring, outages discovered via social media
# - No logging strategy, debugging takes hours
# - No disaster recovery, single points of failure everywhere
# - No backup strategy, data loss risk on every failure
# - No change tracking, impossible to identify what broke
# Result: 8-hour outage during Black Friday
# $2M in lost revenue, 30% customer churn
# Engineering team working 16-hour days for weeks
# Company reputation destroyed, acquisition talks canceled
# The brutal truth: Great infrastructure can't save amateur operations
The uncomfortable production truth: Perfect infrastructure and deployment strategies can’t save you from operational disasters when your CI/CD pipeline is non-existent, monitoring is reactive instead of proactive, disaster recovery is untested, and your team is debugging production issues instead of preventing them. Professional operations requires thinking beyond infrastructure to the entire development lifecycle.
Real-world operations failure consequences:
// What happens when operations practices are amateur:
const operationsFailureImpact = {
deploymentDisasters: {
problem: "Critical bug fix deployment breaks entire application",
cause: "No CI/CD pipeline, no automated testing, manual deployments",
impact: "6-hour outage during peak business hours, revenue loss",
cost: "$500K in lost sales, 20% customer churn",
},
monitoringBlindness: {
problem: "Performance degradation goes unnoticed for hours",
cause: "No proactive monitoring, no alerting, reactive debugging",
impact: "Customers experience slow site, competitors gain market share",
consequences: "Brand reputation damaged, customer satisfaction plummets",
},
disasterRecoveryFailure: {
problem: "Database corruption during peak season with no recovery plan",
cause:
"Untested backups, no disaster recovery procedures, single points of failure",
impact: "Complete data loss, business operations halt for days",
reality: "Company closes permanently, all customer data lost forever",
},
operationalChaos: {
problem:
"Teams spend 80% of time firefighting instead of building features",
cause: "No automation, no monitoring, no proper deployment processes",
impact: "Product development stagnates, competitors outpace innovation",
prevention:
"Professional operations enable teams to focus on value creation",
},
// Perfect infrastructure is worthless when operations
// lack automation, monitoring, disaster recovery, and reliability practices
};
Production operations mastery requires understanding:
- CI/CD pipelines that automate the entire software delivery lifecycle with comprehensive testing and deployment automation
- Advanced Infrastructure as Code that manages complex environments with modules, state management, and collaborative workflows
- Monitoring and alerting that proactively detects issues and provides actionable insights before customers are affected
- Log aggregation and analysis that enables rapid debugging and system understanding through structured, searchable data
- Disaster recovery and backups that ensure business continuity with tested, automated recovery procedures
This article transforms your operations from manual, reactive processes into automated, proactive systems that enable reliable, fast, and confident software delivery at enterprise scale.
CI/CD Pipelines: From Manual Deployments to Automated Excellence
The Evolution from Git Push to Production Excellence
Understanding why manual deployments kill productivity and reliability:
// Manual deployment vs Professional CI/CD pipeline comparison
const deploymentEvolution = {
manualDeployment: {
process: "Developer manually deploys from their laptop",
testing: "Maybe run a few tests locally if remembered",
consistency:
"Every deployment is different, configuration drift guaranteed",
rollback: "Panic-driven git revert, usually makes things worse",
visibility: "No one knows what's deployed or when",
scalability: "One person becomes deployment bottleneck",
quality: "Bugs discovered by customers in production",
reliability: "50% chance of deployment causing outage",
},
professionalCICD: {
process: "Automated pipeline triggered by git events",
testing: "Comprehensive automated test suite runs every time",
consistency: "Identical deployment process every single time",
rollback: "One-click automated rollback to any previous version",
visibility: "Full deployment history and current state tracking",
scalability: "Multiple teams deploying dozens of times per day",
quality: "Issues caught in CI/CD before reaching production",
reliability: "99.9% successful deployments, predictable outcomes",
},
theTransformationImpact: [
"Teams deploy 10x more frequently with higher confidence",
"Bug detection shifts left, issues caught in minutes not hours",
"Rollbacks happen in seconds, not emergency all-hands meetings",
"Developers focus on features, not deployment firefighting",
"Quality improves dramatically with automated testing",
"Infrastructure changes become routine, not risky events",
],
};
GitHub Actions CI/CD Pipeline Implementation:
# .github/workflows/production-deployment.yml - Professional CI/CD pipeline
name: Production Deployment Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
workflow_dispatch:
inputs:
environment:
description: "Deployment environment"
required: true
default: "staging"
type: choice
options:
- staging
- production
env:
NODE_VERSION: "18"
DOCKER_REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
# ========================================
# Code Quality and Security Analysis
# ========================================
code-quality:
name: Code Quality Analysis
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0 # Needed for SonarCloud
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: "npm"
- name: Install dependencies
run: |
npm ci --prefer-offline --no-audit
- name: Run ESLint with annotations
run: |
npx eslint . --format=@microsoft/eslint-formatter-sarif --output-file eslint-results.sarif
npx eslint . --format=stylish
continue-on-error: true
- name: Upload ESLint results to GitHub
uses: github/codeql-action/upload-sarif@v2
if: always()
with:
sarif_file: eslint-results.sarif
- name: Run Prettier check
run: npx prettier --check .
- name: TypeScript type checking
run: npx tsc --noEmit
- name: Security audit
run: |
npm audit --audit-level=high
npx audit-ci --config .auditrc.json
- name: License compliance check
run: |
npx license-checker --onlyAllow "MIT;Apache-2.0;BSD-2-Clause;BSD-3-Clause;ISC" --excludePrivatePackages
- name: SonarCloud Scan
uses: SonarSource/sonarcloud-github-action@master
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
# ========================================
# Comprehensive Testing Suite
# ========================================
unit-tests:
name: Unit Tests
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: "npm"
- name: Install dependencies
run: npm ci --prefer-offline --no-audit
- name: Run unit tests with coverage
run: |
npm run test:unit -- --coverage --watchAll=false --ci
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
token: ${{ secrets.CODECOV_TOKEN }}
file: ./coverage/lcov.info
flags: unittests
name: codecov-umbrella
- name: Comment coverage on PR
if: github.event_name == 'pull_request'
uses: codecov/codecov-action@v3
integration-tests:
name: Integration Tests
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_DB: testdb
POSTGRES_USER: testuser
POSTGRES_PASSWORD: testpass
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
- 5432:5432
redis:
image: redis:7-alpine
options: >-
--health-cmd "redis-cli ping"
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
- 6379:6379
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: "npm"
- name: Install dependencies
run: npm ci --prefer-offline --no-audit
- name: Wait for services to be ready
run: |
timeout 60 bash -c 'until nc -z localhost 5432; do sleep 1; done'
timeout 60 bash -c 'until nc -z localhost 6379; do sleep 1; done'
- name: Run database migrations
run: npm run db:migrate
env:
DATABASE_URL: postgresql://testuser:testpass@localhost:5432/testdb
REDIS_URL: redis://localhost:6379
- name: Run integration tests
run: npm run test:integration
env:
NODE_ENV: test
DATABASE_URL: postgresql://testuser:testpass@localhost:5432/testdb
REDIS_URL: redis://localhost:6379
e2e-tests:
name: End-to-End Tests
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: "npm"
- name: Install dependencies
run: npm ci --prefer-offline --no-audit
- name: Build application
run: npm run build
- name: Start application for E2E tests
run: |
npm start &
timeout 60 bash -c 'until curl -f http://localhost:3000/health; do sleep 2; done'
env:
NODE_ENV: test
PORT: 3000
- name: Run Playwright E2E tests
run: npx playwright test
env:
BASE_URL: http://localhost:3000
- name: Upload Playwright report
uses: actions/upload-artifact@v3
if: always()
with:
name: playwright-report
path: playwright-report/
# ========================================
# Container Security and Building
# ========================================
container-security:
name: Container Security Scan
runs-on: ubuntu-latest
needs: [code-quality]
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Build Docker image for scanning
run: |
docker build -t security-scan:latest .
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: "security-scan:latest"
format: "sarif"
output: "trivy-results.sarif"
- name: Upload Trivy scan results to GitHub Security
uses: github/codeql-action/upload-sarif@v2
if: always()
with:
sarif_file: "trivy-results.sarif"
- name: Run Hadolint Dockerfile linting
uses: hadolint/hadolint-action@v3.1.0
with:
dockerfile: Dockerfile
format: sarif
output-file: hadolint-results.sarif
- name: Upload Hadolint results
uses: github/codeql-action/upload-sarif@v2
if: always()
with:
sarif_file: hadolint-results.sarif
build-and-push:
name: Build and Push Container
runs-on: ubuntu-latest
needs: [unit-tests, integration-tests, container-security]
if: github.ref == 'refs/heads/main' || github.event_name == 'workflow_dispatch'
outputs:
image-digest: ${{ steps.build.outputs.digest }}
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.DOCKER_REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=ref,event=branch
type=sha,prefix={{branch}}-
type=raw,value=latest,enable={{is_default_branch}}
- name: Build and push Docker image
id: build
uses: docker/build-push-action@v5
with:
context: .
platforms: linux/amd64,linux/arm64
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
build-args: |
BUILDTIME=${{ fromJSON(steps.meta.outputs.json).labels['org.opencontainers.image.created'] }}
VERSION=${{ fromJSON(steps.meta.outputs.json).labels['org.opencontainers.image.version'] }}
REVISION=${{ fromJSON(steps.meta.outputs.json).labels['org.opencontainers.image.revision'] }}
# ========================================
# Staging Deployment and Testing
# ========================================
deploy-staging:
name: Deploy to Staging
runs-on: ubuntu-latest
needs: [build-and-push]
if: github.ref == 'refs/heads/main'
environment:
name: staging
url: https://staging.myapp.com
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-west-2
- name: Setup Kubernetes config
run: |
aws eks update-kubeconfig --name myapp-staging-cluster --region us-west-2
- name: Deploy to staging with Helm
run: |
helm upgrade --install myapp-staging ./helm/myapp \
--namespace staging \
--create-namespace \
--set image.repository="${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_NAME }}" \
--set image.tag="${{ github.sha }}" \
--set environment="staging" \
--set ingress.host="staging.myapp.com" \
--wait --timeout=10m
- name: Wait for deployment to be ready
run: |
kubectl rollout status deployment/myapp-staging -n staging --timeout=600s
- name: Run smoke tests against staging
run: |
timeout 300 bash -c 'until curl -f https://staging.myapp.com/health; do sleep 10; done'
npm run test:smoke -- --baseURL=https://staging.myapp.com
- name: Run load tests against staging
run: |
npm run test:load -- --baseURL=https://staging.myapp.com
# ========================================
# Production Deployment with Approval
# ========================================
deploy-production:
name: Deploy to Production
runs-on: ubuntu-latest
needs: [deploy-staging, e2e-tests]
if: github.ref == 'refs/heads/main' || (github.event_name == 'workflow_dispatch' && github.event.inputs.environment == 'production')
environment:
name: production
url: https://myapp.com
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_PROD_ROLE_ARN }}
aws-region: us-west-2
- name: Setup Kubernetes config
run: |
aws eks update-kubeconfig --name myapp-production-cluster --region us-west-2
- name: Pre-deployment health check
run: |
kubectl get nodes
kubectl get pods -A | grep -E "(Crash|Error|ImagePull)" && exit 1 || true
- name: Deploy to production with blue-green strategy
run: |
# Deploy to inactive environment (green if blue is active)
ACTIVE_ENV=$(kubectl get service myapp-active -o jsonpath='{.spec.selector.version}' || echo "blue")
TARGET_ENV=$([ "$ACTIVE_ENV" = "blue" ] && echo "green" || echo "blue")
echo "Deploying to $TARGET_ENV environment (current active: $ACTIVE_ENV)"
helm upgrade --install myapp-$TARGET_ENV ./helm/myapp \
--namespace production \
--create-namespace \
--set image.repository="${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_NAME }}" \
--set image.tag="${{ github.sha }}" \
--set environment="production" \
--set deployment.version="$TARGET_ENV" \
--set ingress.host="myapp.com" \
--wait --timeout=15m
- name: Validate new deployment
run: |
TARGET_ENV=$(kubectl get deployment -l version!=active -o jsonpath='{.items[0].metadata.labels.version}')
# Wait for deployment to be fully ready
kubectl rollout status deployment/myapp-$TARGET_ENV -n production --timeout=600s
# Run comprehensive health checks
kubectl port-forward svc/myapp-$TARGET_ENV 8080:80 -n production &
PF_PID=$!
sleep 10
# Smoke tests
curl -f http://localhost:8080/health || exit 1
npm run test:smoke -- --baseURL=http://localhost:8080
kill $PF_PID
- name: Switch traffic to new deployment
run: |
TARGET_ENV=$(kubectl get deployment -l version!=active -o jsonpath='{.items[0].metadata.labels.version}')
echo "Switching traffic to $TARGET_ENV"
kubectl patch service myapp-active -p '{"spec":{"selector":{"version":"'$TARGET_ENV'"}}}'
# Update labels to mark new environment as active
kubectl label deployment myapp-$TARGET_ENV version=active --overwrite
# Mark old environment as inactive
OLD_ENV=$([ "$TARGET_ENV" = "blue" ] && echo "green" || echo "blue")
kubectl label deployment myapp-$OLD_ENV version=inactive --overwrite
- name: Post-deployment monitoring
run: |
echo "Monitoring deployment for 5 minutes..."
for i in {1..30}; do
if ! curl -f https://myapp.com/health; then
echo "Health check failed, initiating rollback"
# Rollback logic would go here
exit 1
fi
sleep 10
done
echo "Deployment stable, monitoring successful"
- name: Cleanup old deployment
run: |
OLD_ENV=$(kubectl get deployment -l version=inactive -o jsonpath='{.items[0].metadata.labels.version}')
if [ -n "$OLD_ENV" ]; then
echo "Cleaning up old deployment: $OLD_ENV"
helm uninstall myapp-$OLD_ENV --namespace production || true
fi
# ========================================
# Post-deployment notifications
# ========================================
notify-deployment:
name: Notify Deployment Status
runs-on: ubuntu-latest
needs: [deploy-production]
if: always()
steps:
- name: Notify Slack on success
if: needs.deploy-production.result == 'success'
uses: 8398a7/action-slack@v3
with:
status: success
channel: "#deployments"
text: |
🎉 Production deployment successful!
**Repository:** ${{ github.repository }}
**Branch:** ${{ github.ref_name }}
**Commit:** ${{ github.sha }}
**Author:** ${{ github.actor }}
**Deployed to:** https://myapp.com
**Dashboard:** https://grafana.myapp.com
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
- name: Notify Slack on failure
if: needs.deploy-production.result == 'failure'
uses: 8398a7/action-slack@v3
with:
status: failure
channel: "#alerts"
text: |
🚨 Production deployment failed!
**Repository:** ${{ github.repository }}
**Branch:** ${{ github.ref_name }}
**Commit:** ${{ github.sha }}
**Author:** ${{ github.actor }}
**Action:** https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}
Please investigate immediately.
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
- name: Create GitHub release on successful production deploy
if: needs.deploy-production.result == 'success' && github.ref == 'refs/heads/main'
uses: actions/create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
tag_name: production-${{ github.run_number }}
release_name: Production Release ${{ github.run_number }}
body: |
Automated production release
**Commit:** ${{ github.sha }}
**Deployed:** $(date -u +"%Y-%m-%d %H:%M:%S UTC")
[View deployment workflow](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})
draft: false
prerelease: false
GitLab CI/CD Pipeline Implementation:
# .gitlab-ci.yml - Professional GitLab CI/CD pipeline
stages:
- quality
- test
- security
- build
- deploy-staging
- deploy-production
- monitor
variables:
DOCKER_REGISTRY: $CI_REGISTRY
IMAGE_NAME: $CI_REGISTRY_IMAGE
KUBERNETES_VERSION: "1.28"
# ========================================
# Quality and Security Analysis
# ========================================
code-quality:
stage: quality
image: node:18-alpine
cache:
paths:
- node_modules/
script:
- npm ci --prefer-offline
- npm run lint -- --format=junit --output-file=eslint-report.xml
- npm run prettier:check
- npx tsc --noEmit
artifacts:
reports:
junit: eslint-report.xml
paths:
- eslint-report.xml
expire_in: 1 week
dependency-security:
stage: quality
image: node:18-alpine
script:
- npm ci --prefer-offline
- npm audit --audit-level=high
- npx audit-ci --config .auditrc.json
- npx license-checker --onlyAllow "MIT;Apache-2.0;BSD-2-Clause;BSD-3-Clause;ISC"
allow_failure: false
sonarcloud-check:
stage: quality
image: sonarsource/sonar-scanner-cli:latest
variables:
SONAR_USER_HOME: "${CI_PROJECT_DIR}/.sonar"
GIT_DEPTH: "0"
cache:
key: "${CI_JOB_NAME}"
paths:
- .sonar/cache
script:
- sonar-scanner
only:
- main
- merge_requests
# ========================================
# Comprehensive Testing
# ========================================
unit-tests:
stage: test
image: node:18-alpine
services:
- postgres:15
- redis:7-alpine
variables:
POSTGRES_DB: testdb
POSTGRES_USER: testuser
POSTGRES_PASSWORD: testpass
REDIS_URL: redis://redis:6379
DATABASE_URL: postgresql://testuser:testpass@postgres:5432/testdb
cache:
paths:
- node_modules/
before_script:
- npm ci --prefer-offline
- npm run db:migrate
script:
- npm run test:unit -- --coverage --ci --watchAll=false
- npm run test:integration
coverage: '/All files[^|]*\|[^|]*\s+([\d\.]+)/'
artifacts:
reports:
coverage_report:
coverage_format: cobertura
path: coverage/cobertura-coverage.xml
junit: junit.xml
paths:
- coverage/
expire_in: 1 week
e2e-tests:
stage: test
image: mcr.microsoft.com/playwright:v1.40.0-focal
services:
- postgres:15
- redis:7-alpine
variables:
POSTGRES_DB: testdb
POSTGRES_USER: testuser
POSTGRES_PASSWORD: testpass
DATABASE_URL: postgresql://testuser:testpass@postgres:5432/testdb
REDIS_URL: redis://redis:6379
before_script:
- npm ci --prefer-offline
- npm run build
- npm run db:migrate
- npm start &
- sleep 30
- curl -f http://localhost:3000/health
script:
- npx playwright test
artifacts:
when: always
paths:
- playwright-report/
- test-results/
expire_in: 1 week
# ========================================
# Security and Container Scanning
# ========================================
container-security:
stage: security
image: docker:20.10.16
services:
- docker:20.10.16-dind
variables:
DOCKER_TLS_CERTDIR: "/certs"
TRIVY_CACHE_DIR: ".trivycache/"
before_script:
- docker build -t $IMAGE_NAME:$CI_COMMIT_SHA .
- apk add --no-cache curl
- curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh | sh -s -- -b /usr/local/bin
script:
- trivy image --exit-code 1 --severity HIGH,CRITICAL --no-progress $IMAGE_NAME:$CI_COMMIT_SHA
- trivy image --format template --template "@contrib/gitlab.tpl" --output gl-container-scanning-report.json $IMAGE_NAME:$CI_COMMIT_SHA
cache:
paths:
- .trivycache/
artifacts:
reports:
container_scanning: gl-container-scanning-report.json
expire_in: 1 week
dockerfile-lint:
stage: security
image: hadolint/hadolint:latest-alpine
script:
- hadolint --format gitlab_codeclimate --failure-threshold warning Dockerfile > hadolint-report.json
artifacts:
reports:
codequality: hadolint-report.json
expire_in: 1 week
# ========================================
# Build and Registry
# ========================================
build-image:
stage: build
image: docker:20.10.16
services:
- docker:20.10.16-dind
variables:
DOCKER_TLS_CERTDIR: "/certs"
before_script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
script:
- |
# Multi-arch build with buildx
docker buildx create --use
docker buildx build \
--platform linux/amd64,linux/arm64 \
--tag $IMAGE_NAME:$CI_COMMIT_SHA \
--tag $IMAGE_NAME:latest \
--push \
--build-arg BUILDTIME=$(date -u +'%Y-%m-%dT%H:%M:%SZ') \
--build-arg VERSION=$CI_COMMIT_TAG \
--build-arg REVISION=$CI_COMMIT_SHA \
.
only:
- main
- tags
# ========================================
# Staging Deployment
# ========================================
deploy-staging:
stage: deploy-staging
image:
name: alpine/helm:3.12.0
entrypoint: [""]
environment:
name: staging
url: https://staging.myapp.com
before_script:
- apk add --no-cache curl bash
- curl -LO "https://dl.k8s.io/release/v$KUBERNETES_VERSION/bin/linux/amd64/kubectl"
- install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
- echo $KUBE_CONFIG_STAGING | base64 -d > ~/.kube/config
script:
- |
helm upgrade --install myapp-staging ./helm/myapp \
--namespace staging \
--create-namespace \
--set image.repository=$IMAGE_NAME \
--set image.tag=$CI_COMMIT_SHA \
--set environment=staging \
--set ingress.host=staging.myapp.com \
--wait --timeout=10m
kubectl rollout status deployment/myapp-staging -n staging --timeout=600s
# Smoke tests
sleep 30
curl -f https://staging.myapp.com/health
only:
- main
staging-smoke-tests:
stage: deploy-staging
image: node:18-alpine
dependencies:
- deploy-staging
script:
- npm ci --prefer-offline
- npm run test:smoke -- --baseURL=https://staging.myapp.com
- npm run test:load -- --baseURL=https://staging.myapp.com --duration=5m
only:
- main
# ========================================
# Production Deployment
# ========================================
deploy-production:
stage: deploy-production
image:
name: alpine/helm:3.12.0
entrypoint: [""]
environment:
name: production
url: https://myapp.com
before_script:
- apk add --no-cache curl bash jq
- curl -LO "https://dl.k8s.io/release/v$KUBERNETES_VERSION/bin/linux/amd64/kubectl"
- install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
- echo $KUBE_CONFIG_PRODUCTION | base64 -d > ~/.kube/config
script:
- |
# Blue-green deployment strategy
ACTIVE_ENV=$(kubectl get service myapp-active -o jsonpath='{.spec.selector.version}' 2>/dev/null || echo "blue")
TARGET_ENV=$([ "$ACTIVE_ENV" = "blue" ] && echo "green" || echo "blue")
echo "Deploying to $TARGET_ENV (current active: $ACTIVE_ENV)"
# Deploy to target environment
helm upgrade --install myapp-$TARGET_ENV ./helm/myapp \
--namespace production \
--create-namespace \
--set image.repository=$IMAGE_NAME \
--set image.tag=$CI_COMMIT_SHA \
--set environment=production \
--set deployment.version=$TARGET_ENV \
--set ingress.host=myapp.com \
--wait --timeout=15m
kubectl rollout status deployment/myapp-$TARGET_ENV -n production --timeout=600s
# Validation tests
kubectl port-forward svc/myapp-$TARGET_ENV 8080:80 -n production &
PF_PID=$!
sleep 15
curl -f http://localhost:8080/health || exit 1
kill $PF_PID
# Switch traffic
kubectl patch service myapp-active -p '{"spec":{"selector":{"version":"'$TARGET_ENV'"}}}'
kubectl label deployment myapp-$TARGET_ENV version=active --overwrite
# Monitor for 5 minutes
for i in {1..30}; do
curl -f https://myapp.com/health || exit 1
sleep 10
done
# Cleanup old deployment
OLD_ENV=$([ "$TARGET_ENV" = "blue" ] && echo "green" || echo "blue")
helm uninstall myapp-$OLD_ENV --namespace production || true
when: manual
only:
- main
# ========================================
# Monitoring and Notifications
# ========================================
post-deploy-monitoring:
stage: monitor
image: curlimages/curl:latest
script:
- |
# Send deployment notification
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"✅ Production deployment successful!\n**Project:** $CI_PROJECT_NAME\n**Commit:** $CI_COMMIT_SHA\n**Pipeline:** $CI_PIPELINE_URL\"}" \
$SLACK_WEBHOOK_URL
# Trigger monitoring dashboard update
curl -X POST -H "Authorization: Bearer $GRAFANA_API_KEY" \
-H "Content-Type: application/json" \
-d '{"dashboard": {"id": null, "title": "Deployment Tracking", "tags": ["deployment"], "timezone": "browser"}}' \
$GRAFANA_URL/api/dashboards/db
dependencies:
- deploy-production
when: on_success
only:
- main
Advanced Infrastructure as Code: Modules, State, and Collaboration
Beyond Basic Terraform: Enterprise Infrastructure Management
Understanding advanced Infrastructure as Code patterns:
# terraform/modules/networking/main.tf - Reusable networking module
variable "project_name" {
description = "Name of the project"
type = string
}
variable "environment" {
description = "Environment (dev, staging, production)"
type = string
}
variable "region" {
description = "AWS region"
type = string
}
variable "availability_zones" {
description = "List of availability zones"
type = list(string)
}
variable "vpc_cidr" {
description = "CIDR block for VPC"
type = string
default = "10.0.0.0/16"
}
variable "enable_nat_gateway" {
description = "Enable NAT Gateway for private subnets"
type = bool
default = true
}
variable "enable_vpn_gateway" {
description = "Enable VPN Gateway"
type = bool
default = false
}
locals {
common_tags = {
Project = var.project_name
Environment = var.environment
Module = "networking"
ManagedBy = "terraform"
}
# Calculate subnet CIDRs automatically
public_subnet_cidrs = [for i in range(length(var.availability_zones)) : cidrsubnet(var.vpc_cidr, 8, i)]
private_subnet_cidrs = [for i in range(length(var.availability_zones)) : cidrsubnet(var.vpc_cidr, 8, i + 10)]
database_subnet_cidrs = [for i in range(length(var.availability_zones)) : cidrsubnet(var.vpc_cidr, 8, i + 20)]
}
# VPC
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-vpc"
})
}
# Internet Gateway
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-igw"
})
}
# Public Subnets
resource "aws_subnet" "public" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = local.public_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index]
map_public_ip_on_launch = true
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-public-${count.index + 1}"
Type = "public"
"kubernetes.io/role/elb" = "1"
})
}
# Private Subnets
resource "aws_subnet" "private" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = local.private_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index]
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-private-${count.index + 1}"
Type = "private"
"kubernetes.io/role/internal-elb" = "1"
})
}
# Database Subnets
resource "aws_subnet" "database" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = local.database_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index]
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-database-${count.index + 1}"
Type = "database"
})
}
# NAT Gateways (conditional)
resource "aws_eip" "nat" {
count = var.enable_nat_gateway ? length(var.availability_zones) : 0
domain = "vpc"
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-nat-eip-${count.index + 1}"
})
}
resource "aws_nat_gateway" "main" {
count = var.enable_nat_gateway ? length(var.availability_zones) : 0
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-nat-${count.index + 1}"
})
depends_on = [aws_internet_gateway.main]
}
# Route Tables
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-public-rt"
})
}
resource "aws_route_table" "private" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
dynamic "route" {
for_each = var.enable_nat_gateway ? [1] : []
content {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.main[count.index].id
}
}
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-private-rt-${count.index + 1}"
})
}
resource "aws_route_table" "database" {
vpc_id = aws_vpc.main.id
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-database-rt"
})
}
# Route Table Associations
resource "aws_route_table_association" "public" {
count = length(var.availability_zones)
subnet_id = aws_subnet.public[count.index].id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "private" {
count = length(var.availability_zones)
subnet_id = aws_subnet.private[count.index].id
route_table_id = aws_route_table.private[count.index].id
}
resource "aws_route_table_association" "database" {
count = length(var.availability_zones)
subnet_id = aws_subnet.database[count.index].id
route_table_id = aws_route_table.database.id
}
# Database Subnet Group
resource "aws_db_subnet_group" "main" {
name = "${var.project_name}-${var.environment}-db-subnet-group"
subnet_ids = aws_subnet.database[*].id
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-db-subnet-group"
})
}
# VPC Flow Logs
resource "aws_flow_log" "main" {
iam_role_arn = aws_iam_role.flow_log.arn
log_destination = aws_cloudwatch_log_group.vpc_flow_log.arn
traffic_type = "ALL"
vpc_id = aws_vpc.main.id
}
resource "aws_cloudwatch_log_group" "vpc_flow_log" {
name = "/aws/vpc/flowlogs/${var.project_name}-${var.environment}"
retention_in_days = 30
tags = local.common_tags
}
resource "aws_iam_role" "flow_log" {
name = "${var.project_name}-${var.environment}-flow-log-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "vpc-flow-logs.amazonaws.com"
}
}
]
})
tags = local.common_tags
}
resource "aws_iam_role_policy" "flow_log" {
name = "${var.project_name}-${var.environment}-flow-log-policy"
role = aws_iam_role.flow_log.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams"
]
Resource = "*"
}
]
})
}
# Outputs
output "vpc_id" {
description = "ID of the VPC"
value = aws_vpc.main.id
}
output "vpc_cidr_block" {
description = "CIDR block of the VPC"
value = aws_vpc.main.cidr_block
}
output "public_subnet_ids" {
description = "IDs of the public subnets"
value = aws_subnet.public[*].id
}
output "private_subnet_ids" {
description = "IDs of the private subnets"
value = aws_subnet.private[*].id
}
output "database_subnet_ids" {
description = "IDs of the database subnets"
value = aws_subnet.database[*].id
}
output "database_subnet_group_name" {
description = "Name of the database subnet group"
value = aws_db_subnet_group.main.name
}
output "internet_gateway_id" {
description = "ID of the Internet Gateway"
value = aws_internet_gateway.main.id
}
output "nat_gateway_ids" {
description = "IDs of the NAT Gateways"
value = aws_nat_gateway.main[*].id
}
Advanced Terraform State Management:
# terraform/environments/production/backend.tf - Remote state configuration
terraform {
backend "s3" {
bucket = "myapp-terraform-state-production"
key = "infrastructure/production/terraform.tfstate"
region = "us-west-2"
encrypt = true
dynamodb_table = "terraform-locks-production"
# State locking and consistency checking
skip_credentials_validation = false
skip_metadata_api_check = false
skip_region_validation = false
force_path_style = false
}
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.23"
}
helm = {
source = "hashicorp/helm"
version = "~> 2.11"
}
}
}
# terraform/environments/production/main.tf - Environment-specific configuration
locals {
environment = "production"
region = "us-west-2"
# Environment-specific configurations
cluster_config = {
version = "1.28"
node_groups = {
main = {
instance_types = ["m5.large", "m5.xlarge"]
capacity_type = "ON_DEMAND"
min_size = 3
max_size = 20
desired_size = 5
}
spot = {
instance_types = ["m5.large", "m5.xlarge", "m5a.large", "m5a.xlarge"]
capacity_type = "SPOT"
min_size = 2
max_size = 15
desired_size = 3
}
}
}
database_config = {
engine_version = "15.4"
instance_class = "db.r5.xlarge"
multi_az = true
backup_retention = 30
storage_size = 500
max_storage = 2000
}
redis_config = {
node_type = "cache.r6g.large"
num_cache_clusters = 3
automatic_failover = true
multi_az = true
}
}
data "aws_availability_zones" "available" {
state = "available"
}
# Networking Module
module "networking" {
source = "../../modules/networking"
project_name = var.project_name
environment = local.environment
region = local.region
availability_zones = slice(data.aws_availability_zones.available.names, 0, 3)
vpc_cidr = "10.0.0.0/16"
enable_nat_gateway = true
enable_vpn_gateway = false
}
# Security Module
module "security" {
source = "../../modules/security"
project_name = var.project_name
environment = local.environment
vpc_id = module.networking.vpc_id
vpc_cidr = module.networking.vpc_cidr_block
}
# EKS Cluster Module
module "eks" {
source = "../../modules/eks"
project_name = var.project_name
environment = local.environment
region = local.region
kubernetes_version = local.cluster_config.version
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.private_subnet_ids
security_group_ids = [module.security.cluster_security_group_id]
node_groups = local.cluster_config.node_groups
# Add-ons
enable_cluster_autoscaler = true
enable_aws_load_balancer_controller = true
enable_external_dns = true
enable_cert_manager = true
tags = {
Environment = local.environment
Terraform = "true"
}
}
# Database Module
module "database" {
source = "../../modules/database"
project_name = var.project_name
environment = local.environment
vpc_id = module.networking.vpc_id
subnet_group_name = module.networking.database_subnet_group_name
security_group_ids = [module.security.database_security_group_id]
engine_version = local.database_config.engine_version
instance_class = local.database_config.instance_class
allocated_storage = local.database_config.storage_size
max_allocated_storage = local.database_config.max_storage
multi_az = local.database_config.multi_az
backup_retention_period = local.database_config.backup_retention
# Performance Insights
performance_insights_enabled = true
performance_insights_retention_period = 7
# Enhanced Monitoring
monitoring_interval = 60
monitoring_role_arn = aws_iam_role.rds_enhanced_monitoring.arn
tags = {
Environment = local.environment
Terraform = "true"
}
}
# Enhanced Monitoring IAM Role for RDS
resource "aws_iam_role" "rds_enhanced_monitoring" {
name = "${var.project_name}-${local.environment}-rds-monitoring-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "monitoring.rds.amazonaws.com"
}
}
]
})
}
resource "aws_iam_role_policy_attachment" "rds_enhanced_monitoring" {
role = aws_iam_role.rds_enhanced_monitoring.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonRDSEnhancedMonitoringRole"
}
# Redis Module
module "redis" {
source = "../../modules/redis"
project_name = var.project_name
environment = local.environment
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.private_subnet_ids
security_group_ids = [module.security.redis_security_group_id]
node_type = local.redis_config.node_type
num_cache_clusters = local.redis_config.num_cache_clusters
automatic_failover_enabled = local.redis_config.automatic_failover
multi_az_enabled = local.redis_config.multi_az
# Security
at_rest_encryption_enabled = true
transit_encryption_enabled = true
auth_token_enabled = true
# Backup
snapshot_retention_limit = 7
snapshot_window = "03:00-05:00"
tags = {
Environment = local.environment
Terraform = "true"
}
}
# Monitoring Module
module "monitoring" {
source = "../../modules/monitoring"
project_name = var.project_name
environment = local.environment
cluster_name = module.eks.cluster_name
vpc_id = module.networking.vpc_id
# SNS Topics for alerts
create_sns_topics = true
alert_email = var.alert_email
# CloudWatch Dashboard
create_dashboard = true
# Log retention
log_retention_days = 30
tags = {
Environment = local.environment
Terraform = "true"
}
}
# Outputs
output "cluster_endpoint" {
description = "EKS cluster endpoint"
value = module.eks.cluster_endpoint
sensitive = true
}
output "cluster_name" {
description = "EKS cluster name"
value = module.eks.cluster_name
}
output "database_endpoint" {
description = "RDS database endpoint"
value = module.database.endpoint
sensitive = true
}
output "redis_endpoint" {
description = "Redis cluster endpoint"
value = module.redis.configuration_endpoint
sensitive = true
}
output "vpc_id" {
description = "VPC ID"
value = module.networking.vpc_id
}
Terraform Workspace Management:
#!/bin/bash
# terraform-management.sh - Professional Terraform workflow automation
set -euo pipefail
PROJECT_NAME="${PROJECT_NAME:-myapp}"
TERRAFORM_VERSION="${TERRAFORM_VERSION:-1.5.7}"
AWS_REGION="${AWS_REGION:-us-west-2}"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
log() {
echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')] $1${NC}"
}
warn() {
echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING: $1${NC}"
}
error() {
echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR: $1${NC}"
exit 1
}
check_prerequisites() {
log "Checking prerequisites..."
# Check Terraform installation
if ! command -v terraform &> /dev/null; then
error "Terraform is not installed"
fi
local tf_version=$(terraform version -json | jq -r '.terraform_version')
if [[ "$tf_version" != "$TERRAFORM_VERSION" ]]; then
warn "Expected Terraform $TERRAFORM_VERSION, found $tf_version"
fi
# Check AWS CLI
if ! command -v aws &> /dev/null; then
error "AWS CLI is not installed"
fi
# Verify AWS credentials
if ! aws sts get-caller-identity &> /dev/null; then
error "AWS credentials not configured or invalid"
fi
log "Prerequisites check passed"
}
setup_terraform_backend() {
local environment="$1"
log "Setting up Terraform backend for $environment..."
# Create S3 bucket for state if it doesn't exist
local bucket_name="${PROJECT_NAME}-terraform-state-${environment}"
if ! aws s3 ls "s3://$bucket_name" &> /dev/null; then
log "Creating S3 bucket: $bucket_name"
aws s3 mb "s3://$bucket_name" --region "$AWS_REGION"
# Enable versioning
aws s3api put-bucket-versioning \
--bucket "$bucket_name" \
--versioning-configuration Status=Enabled
# Enable server-side encryption
aws s3api put-bucket-encryption \
--bucket "$bucket_name" \
--server-side-encryption-configuration '{
"Rules": [{
"ApplyServerSideEncryptionByDefault": {
"SSEAlgorithm": "AES256"
}
}]
}'
# Block public access
aws s3api put-public-access-block \
--bucket "$bucket_name" \
--public-access-block-configuration \
BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true
fi
# Create DynamoDB table for locking if it doesn't exist
local table_name="terraform-locks-${environment}"
if ! aws dynamodb describe-table --table-name "$table_name" &> /dev/null; then
log "Creating DynamoDB table: $table_name"
aws dynamodb create-table \
--table-name "$table_name" \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5 \
--region "$AWS_REGION"
# Wait for table to be active
aws dynamodb wait table-exists --table-name "$table_name"
fi
log "Terraform backend setup completed for $environment"
}
terraform_plan() {
local environment="$1"
local workspace_dir="terraform/environments/$environment"
if [[ ! -d "$workspace_dir" ]]; then
error "Environment directory not found: $workspace_dir"
fi
cd "$workspace_dir"
log "Planning Terraform changes for $environment..."
# Initialize if needed
terraform init -upgrade
# Validate configuration
terraform validate
# Format check
if ! terraform fmt -check -recursive; then
warn "Terraform files are not properly formatted. Run 'terraform fmt -recursive' to fix."
fi
# Security scan with tfsec
if command -v tfsec &> /dev/null; then
log "Running security scan with tfsec..."
tfsec . --format=junit --out=tfsec-report.xml || warn "Security issues found"
fi
# Cost estimation with Infracost
if command -v infracost &> /dev/null && [[ -n "${INFRACOST_API_KEY:-}" ]]; then
log "Generating cost estimate..."
infracost breakdown --path . --format json --out-file infracost.json
infracost diff --path . --format table
fi
# Plan with detailed output
terraform plan \
-detailed-exitcode \
-out="tfplan-$(date +%Y%m%d-%H%M%S).plan" \
-var="project_name=$PROJECT_NAME" \
-var-file="terraform.tfvars"
local plan_exit_code=$?
if [[ $plan_exit_code -eq 1 ]]; then
error "Terraform plan failed"
elif [[ $plan_exit_code -eq 2 ]]; then
log "Terraform plan completed with changes"
else
log "Terraform plan completed - no changes"
fi
cd - > /dev/null
}
terraform_apply() {
local environment="$1"
local workspace_dir="terraform/environments/$environment"
local auto_approve="${2:-false}"
if [[ ! -d "$workspace_dir" ]]; then
error "Environment directory not found: $workspace_dir"
fi
cd "$workspace_dir"
log "Applying Terraform changes for $environment..."
# Find the latest plan file
local plan_file=$(ls -t tfplan-*.plan 2>/dev/null | head -n1 || echo "")
if [[ -z "$plan_file" ]]; then
warn "No plan file found. Running plan first..."
terraform plan -var="project_name=$PROJECT_NAME" -var-file="terraform.tfvars"
fi
# Apply changes
local apply_args=(-var="project_name=$PROJECT_NAME" -var-file="terraform.tfvars")
if [[ "$auto_approve" == "true" ]]; then
apply_args+=(-auto-approve)
fi
if [[ -n "$plan_file" ]]; then
terraform apply "${apply_args[@]}" "$plan_file"
else
terraform apply "${apply_args[@]}"
fi
# Clean up old plan files
find . -name "tfplan-*.plan" -mtime +7 -delete
log "Terraform apply completed for $environment"
# Output important values
log "Retrieving outputs..."
terraform output -json > "outputs-$(date +%Y%m%d-%H%M%S).json"
cd - > /dev/null
}
terraform_destroy() {
local environment="$1"
local workspace_dir="terraform/environments/$environment"
if [[ ! -d "$workspace_dir" ]]; then
error "Environment directory not found: $workspace_dir"
fi
if [[ "$environment" == "production" ]]; then
error "Destruction of production environment requires manual confirmation"
fi
cd "$workspace_dir"
warn "This will DESTROY all resources in $environment environment!"
read -p "Are you absolutely sure? Type 'yes' to confirm: " confirmation
if [[ "$confirmation" != "yes" ]]; then
log "Destruction cancelled"
cd - > /dev/null
return
fi
log "Destroying Terraform resources for $environment..."
terraform destroy \
-var="project_name=$PROJECT_NAME" \
-var-file="terraform.tfvars"
log "Terraform destroy completed for $environment"
cd - > /dev/null
}
state_management() {
local action="$1"
local environment="$2"
local workspace_dir="terraform/environments/$environment"
if [[ ! -d "$workspace_dir" ]]; then
error "Environment directory not found: $workspace_dir"
fi
cd "$workspace_dir"
case "$action" in
"list")
log "Listing Terraform state resources for $environment..."
terraform state list
;;
"show")
local resource="$3"
if [[ -z "$resource" ]]; then
error "Resource name required for show command"
fi
terraform state show "$resource"
;;
"pull")
log "Pulling remote state for $environment..."
terraform state pull > "state-backup-$(date +%Y%m%d-%H%M%S).json"
log "State backed up locally"
;;
"refresh")
log "Refreshing Terraform state for $environment..."
terraform refresh -var="project_name=$PROJECT_NAME" -var-file="terraform.tfvars"
;;
*)
error "Unknown state action: $action"
;;
esac
cd - > /dev/null
}
# Main command router
case "${1:-help}" in
"init")
check_prerequisites
setup_terraform_backend "${2:-staging}"
;;
"plan")
check_prerequisites
terraform_plan "${2:-staging}"
;;
"apply")
check_prerequisites
terraform_apply "${2:-staging}" "${3:-false}"
;;
"destroy")
check_prerequisites
terraform_destroy "${2:-staging}"
;;
"state")
check_prerequisites
state_management "${2:-list}" "${3:-staging}" "${4:-}"
;;
"help"|*)
cat << EOF
Terraform Infrastructure Management
Usage: $0 <command> [options]
Commands:
init <environment> Initialize Terraform backend
plan <environment> Plan infrastructure changes
apply <environment> [auto] Apply infrastructure changes
destroy <environment> Destroy infrastructure (staging only)
state <action> <environment> Manage Terraform state
State Actions:
list List all resources
show <resource> Show resource details
pull Backup state locally
refresh Refresh state from real infrastructure
Environments:
staging Staging environment
production Production environment
Examples:
$0 init staging
$0 plan production
$0 apply staging auto
$0 state list production
$0 state show production aws_vpc.main
EOF
;;
esac
Monitoring and Alerting: Proactive Operations Excellence
From Reactive Firefighting to Proactive Problem Prevention
Understanding the monitoring maturity model:
// Monitoring evolution: Reactive to Predictive Operations
const monitoringMaturity = {
level1_Reactive: {
approach: "Monitor basic uptime, react to customer complaints",
characteristics: [
"Ping monitoring for basic availability",
"Manual log checking when problems occur",
"Alerts come from angry customers on social media",
"Debugging happens during outages",
],
problems: [
"Issues discovered hours after they occur",
"No visibility into performance degradation",
"Root cause analysis takes days",
"Customer experience suffers",
],
reality: "You're always one step behind problems",
},
level2_Proactive: {
approach: "Monitor key metrics, alert before customer impact",
characteristics: [
"Application performance monitoring (APM)",
"Infrastructure metrics and thresholds",
"Automated alerts for critical issues",
"Basic dashboards and visualization",
],
benefits: [
"Issues detected before customer complaints",
"Faster mean time to resolution (MTTR)",
"Better understanding of system behavior",
"Reduced firefighting stress",
],
limitations: "Still reactive to symptoms, not causes",
},
level3_Predictive: {
approach: "Predict problems, prevent outages, optimize performance",
characteristics: [
"Machine learning-based anomaly detection",
"Predictive alerting based on trends",
"Automatic remediation for known issues",
"Comprehensive observability platform",
],
advantages: [
"Problems prevented before they occur",
"Automatic scaling and optimization",
"Data-driven capacity planning",
"Continuous performance improvement",
],
outcome: "Operations become invisible to customers",
},
};
Comprehensive Monitoring Stack with Prometheus:
# monitoring/prometheus/prometheus.yml - Production Prometheus configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: "production"
region: "us-west-2"
environment: "production"
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
path_prefix: /alertmanager
scheme: http
# Rules for alerts and recording rules
rule_files:
- "/etc/prometheus/rules/*.yml"
- "/etc/prometheus/alerts/*.yml"
# Scrape configurations
scrape_configs:
# Prometheus self-monitoring
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
scrape_interval: 5s
metrics_path: /metrics
# Node Exporter for system metrics
- job_name: "node-exporter"
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- monitoring
relabel_configs:
- source_labels: [__meta_kubernetes_endpoints_name]
action: keep
regex: node-exporter
- source_labels: [__meta_kubernetes_endpoint_address_target_name]
target_label: node
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: instance
# Kubernetes API Server
- job_name: "kubernetes-apiservers"
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- default
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels:
[
__meta_kubernetes_namespace,
__meta_kubernetes_service_name,
__meta_kubernetes_endpoint_port_name,
]
action: keep
regex: default;kubernetes;https
# Kubernetes nodes (kubelet)
- job_name: "kubernetes-nodes"
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# Application pods
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels:
[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# Database monitoring (PostgreSQL)
- job_name: "postgres-exporter"
static_configs:
- targets: ["postgres-exporter:9187"]
scrape_interval: 30s
# Redis monitoring
- job_name: "redis-exporter"
static_configs:
- targets: ["redis-exporter:9121"]
scrape_interval: 30s
# AWS CloudWatch metrics
- job_name: "cloudwatch-exporter"
static_configs:
- targets: ["cloudwatch-exporter:9106"]
scrape_interval: 60s
# Application-specific metrics
- job_name: "myapp"
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- production
relabel_configs:
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels:
[__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
# Remote write configuration for long-term storage
remote_write:
- url: "https://prometheus-prod.monitoring.myapp.com/api/v1/write"
queue_config:
max_samples_per_send: 1000
max_shards: 200
capacity: 2500
Advanced Alerting Rules:
# monitoring/prometheus/alerts/application-alerts.yml - Comprehensive alerting rules
groups:
- name: application.rules
rules:
# ========================================
# Application Availability Alerts
# ========================================
- alert: ApplicationDown
expr: up{job="myapp"} == 0
for: 30s
labels:
severity: critical
team: platform
service: "{{ $labels.kubernetes_name }}"
annotations:
summary: "Application {{ $labels.instance }} is down"
description: |
Application {{ $labels.kubernetes_name }} in namespace {{ $labels.kubernetes_namespace }}
has been down for more than 30 seconds.
Instance: {{ $labels.instance }}
Job: {{ $labels.job }}
runbook_url: "https://runbooks.myapp.com/application-down"
dashboard_url: "https://grafana.myapp.com/d/app-overview"
- alert: ApplicationHighErrorRate
expr: |
(
rate(http_requests_total{job="myapp", status=~"5.."}[5m]) /
rate(http_requests_total{job="myapp"}[5m])
) > 0.05
for: 2m
labels:
severity: warning
team: platform
service: "{{ $labels.kubernetes_name }}"
annotations:
summary: "High error rate detected for {{ $labels.service }}"
description: |
Application {{ $labels.service }} is experiencing {{ $value | humanizePercentage }} error rate
for more than 2 minutes.
Current error rate: {{ $value | humanizePercentage }}
Threshold: 5%
runbook_url: "https://runbooks.myapp.com/high-error-rate"
- alert: ApplicationHighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{job="myapp"}[5m])
) > 0.5
for: 2m
labels:
severity: warning
team: platform
service: "{{ $labels.kubernetes_name }}"
annotations:
summary: "High latency detected for {{ $labels.service }}"
description: |
Application {{ $labels.service }} 95th percentile latency is {{ $value }}s
for more than 2 minutes.
Current P95 latency: {{ $value }}s
Threshold: 0.5s
runbook_url: "https://runbooks.myapp.com/high-latency"
# ========================================
# Infrastructure Alerts
# ========================================
- alert: HighCPUUsage
expr: |
(
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
) > 80
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: |
CPU usage on {{ $labels.instance }} has been above 80% for more than 5 minutes.
Current usage: {{ $value | humanizePercentage }}
Threshold: 80%
runbook_url: "https://runbooks.myapp.com/high-cpu"
- alert: HighMemoryUsage
expr: |
(
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
) > 0.85
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: |
Memory usage on {{ $labels.instance }} has been above 85% for more than 5 minutes.
Current usage: {{ $value | humanizePercentage }}
Available: {{ with query "node_memory_MemAvailable_bytes{instance=\"" }}{{ . | first | value | humanize1024 }}B{{ end }}
Total: {{ with query "node_memory_MemTotal_bytes{instance=\"" }}{{ . | first | value | humanize1024 }}B{{ end }}
runbook_url: "https://runbooks.myapp.com/high-memory"
- alert: DiskSpaceLow
expr: |
(
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100
) > 85
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: |
Disk usage on {{ $labels.instance }}:{{ $labels.mountpoint }} is {{ $value | humanizePercentage }}.
Available: {{ with query "node_filesystem_avail_bytes{instance=\"" }}{{ . | first | value | humanize1024 }}B{{ end }}
Total: {{ with query "node_filesystem_size_bytes{instance=\"" }}{{ . | first | value | humanize1024 }}B{{ end }}
runbook_url: "https://runbooks.myapp.com/disk-space-low"
# ========================================
# Database Alerts
# ========================================
- alert: DatabaseConnectionsHigh
expr: |
(pg_stat_activity_count / pg_settings_max_connections) > 0.8
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High database connections on {{ $labels.instance }}"
description: |
Database {{ $labels.instance }} is using {{ $value | humanizePercentage }} of max connections.
Current connections: {{ with query "pg_stat_activity_count{instance=\"" }}{{ . | first | value }}{{ end }}
Max connections: {{ with query "pg_settings_max_connections{instance=\"" }}{{ . | first | value }}{{ end }}
runbook_url: "https://runbooks.myapp.com/db-connections-high"
- alert: DatabaseSlowQueries
expr: |
rate(pg_stat_statements_mean_time_ms[5m]) > 1000
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Slow database queries detected on {{ $labels.instance }}"
description: |
Database {{ $labels.instance }} has queries with average execution time of {{ $value }}ms.
Query: {{ $labels.query }}
Average time: {{ $value }}ms
runbook_url: "https://runbooks.myapp.com/slow-queries"
# ========================================
# Kubernetes Alerts
# ========================================
- alert: KubernetesPodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[10m]) > 0
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: |
Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container }}
is restarting {{ $value }} times per second.
runbook_url: "https://runbooks.myapp.com/pod-crash-loop"
- alert: KubernetesNodeNotReady
expr: |
kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Kubernetes node {{ $labels.node }} is not ready"
description: |
Node {{ $labels.node }} has been in NotReady state for more than 5 minutes.
runbook_url: "https://runbooks.myapp.com/node-not-ready"
- alert: KubernetesPodPending
expr: |
kube_pod_status_phase{phase="Pending"} == 1
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} stuck in Pending"
description: |
Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in Pending state for more than 10 minutes.
This usually indicates resource constraints or scheduling issues.
runbook_url: "https://runbooks.myapp.com/pod-pending"
- name: sli.rules
interval: 30s
rules:
# ========================================
# SLI (Service Level Indicator) Recording Rules
# ========================================
- record: sli:http_requests:rate5m
expr: |
sum(rate(http_requests_total[5m])) by (service, method, status)
- record: sli:http_request_duration:p95_5m
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
)
- record: sli:http_request_duration:p99_5m
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
)
- record: sli:availability:5m
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m])) by (service) /
sum(rate(http_requests_total[5m])) by (service)
- record: sli:error_rate:5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
sum(rate(http_requests_total[5m])) by (service)
Alertmanager Configuration for Intelligent Routing:
# monitoring/alertmanager/alertmanager.yml - Professional alert routing
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@myapp.com'
smtp_auth_username: 'alerts@myapp.com'
smtp_auth_password_file: '/etc/alertmanager/smtp_password'
# Templates for alert formatting
templates:
- '/etc/alertmanager/templates/*.tmpl'
# Alert routing tree
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'default'
# Routing based on severity and team
routes:
# Critical alerts - immediate notification
- match:
severity: critical
group_wait: 10s
group_interval: 1m
repeat_interval: 5m
receiver: 'critical-alerts'
routes:
# Database critical issues
- match_re:
alertname: 'Database.*'
receiver: 'database-critical'
# Application down
- match:
alertname: 'ApplicationDown'
receiver: 'application-critical'
# Warning alerts - less urgent
- match:
severity: warning
group_wait: 2m
group_interval: 10m
repeat_interval: 4h
receiver: 'warning-alerts'
routes:
# Performance issues
- match_re:
alertname: '.*HighLatency|.*HighErrorRate'
receiver: 'performance-team'
# Infrastructure issues
- match_re:
alertname: 'High.*Usage|.*DiskSpace.*'
receiver: 'infrastructure-team'
# Business hours only alerts
- match:
severity: info
group_wait: 5m
group_interval: 30m
repeat_interval: 24h
receiver: 'info-alerts'
active_time_intervals:
- 'business-hours'
# Receivers define how alerts are sent
receivers:
- name: 'default'
slack_configs:
- api_url_file: '/etc/alertmanager/slack_webhook'
channel: '#alerts'
title: 'Default Alert'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
{{ end }}
- name: 'critical-alerts'
# Multiple notification channels for critical alerts
slack_configs:
- api_url_file: '/etc/alertmanager/slack_webhook'
channel: '#critical-alerts'
title: '🚨 CRITICAL ALERT 🚨'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Service:* {{ .Labels.service }}
*Description:* {{ .Annotations.description }}
*Runbook:* {{ .Annotations.runbook_url }}
*Dashboard:* {{ .Annotations.dashboard_url }}
{{ end }}
actions:
- type: button
text: 'View Runbook'
url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
- type: button
text: 'View Dashboard'
url: '{{ (index .Alerts 0).Annotations.dashboard_url }}'
email_configs:
- to: 'oncall@myapp.com'
subject: '🚨 CRITICAL: {{ (index .Alerts 0).Annotations.summary }}'
html: |
<h2>Critical Alert Triggered</h2>
{{ range .Alerts }}
<h3>{{ .Annotations.summary }}</h3>
<p><strong>Service:</strong> {{ .Labels.service }}</p>
<p><strong>Description:</strong> {{ .Annotations.description }}</p>
<p><strong>Runbook:</strong> <a href="{{ .Annotations.runbook_url }}">{{ .Annotations.runbook_url }}</a></p>
<p><strong>Dashboard:</strong> <a href="{{ .Annotations.dashboard_url }}">{{ .Annotations.dashboard_url }}</a></p>
{{ end }}
# PagerDuty integration for critical alerts
pagerduty_configs:
- routing_key_file: '/etc/alertmanager/pagerduty_key'
description: '{{ (index .Alerts 0).Annotations.summary }}'
details:
service: '{{ (index .Alerts 0).Labels.service }}'
severity: '{{ (index .Alerts 0).Labels.severity }}'
runbook: '{{ (index .Alerts 0).Annotations.runbook_url }}'
- name: 'database-critical'
slack_configs:
- api_url_file: '/etc/alertmanager/slack_webhook'
channel: '#database-alerts'
title: '🗄️ DATABASE CRITICAL ALERT'
text: |
{{ range .Alerts }}
*Database Alert:* {{ .Annotations.summary }}
*Instance:* {{ .Labels.instance }}
*Description:* {{ .Annotations.description }}
{{ end }}
email_configs:
- to: 'dba-team@myapp.com,oncall@myapp.com'
subject: 'DATABASE CRITICAL: {{ (index .Alerts 0).Annotations.summary }}'
- name: 'performance-team'
slack_configs:
- api_url_file: '/etc/alertmanager/slack_webhook'
channel: '#performance'
title: '⚡ Performance Alert'
text: |
{{ range .Alerts }}
*Performance Issue:* {{ .Annotations.summary }}
*Service:* {{ .Labels.service }}
*Current Value:* {{ .Annotations.current_value }}
{{ end }}
- name: 'infrastructure-team'
slack_configs:
- api_url_file: '/etc/alertmanager/slack_webhook'
channel: '#infrastructure'
title: '🏗️ Infrastructure Alert'
text: |
{{ range .Alerts }}
*Infrastructure Issue:* {{ .Annotations.summary }}
*Node:* {{ .Labels.instance }}
*Details:* {{ .Annotations.description }}
{{ end }}
# Inhibition rules - suppress certain alerts when others are firing
inhibit_rules:
# Suppress all other alerts when ApplicationDown is firing
- source_match:
alertname: 'ApplicationDown'
target_match_re:
alertname: '.*HighLatency|.*HighErrorRate'
equal: ['service']
# Suppress node alerts when entire node is down
- source_match:
alertname: 'KubernetesNodeNotReady'
target_match_re:
alertname: 'High.*Usage'
equal: ['instance']
# Time intervals for business hours alerting
time_intervals:
- name: 'business-hours'
time_intervals:
- times:
- start_time: '09:00'
end_time: '18:00'
weekdays: ['monday:friday']
location: 'America/New_York'
Log Aggregation and Analysis: Making Sense of System Behavior
From Log Chaos to Operational Intelligence
Understanding the log management evolution:
// Log management maturity: From chaos to intelligence
const logManagementEvolution = {
chaosStage: {
approach: "Logs scattered across servers, manual grep when things break",
characteristics: [
"SSH into servers to read log files",
"No standardized logging format",
"Logs rotated and lost regularly",
"Debugging requires accessing multiple servers",
],
problems: [
"Root cause analysis takes hours or days",
"No correlation between different services",
"Historical data lost due to rotation",
"Debugging distributed systems is impossible",
],
reality:
"Logs are write-only data - you collect them but can't use them effectively",
},
centralizedStage: {
approach: "All logs flow to central system, searchable and retained",
characteristics: [
"Centralized log collection (ELK, Fluentd)",
"Structured logging with consistent formats",
"Search and filter capabilities",
"Log retention and archival policies",
],
benefits: [
"Single place to search all logs",
"Better troubleshooting capabilities",
"Historical analysis possible",
"Correlation across services",
],
limitations: "Still reactive - logs are used after problems occur",
},
intelligentStage: {
approach: "Logs become operational intelligence, proactive insights",
characteristics: [
"Real-time log analysis and alerting",
"Machine learning for anomaly detection",
"Automatic correlation and pattern recognition",
"Integration with metrics and traces (observability)",
],
advantages: [
"Proactive problem detection from log patterns",
"Automatic root cause analysis",
"Predictive insights from log trends",
"Complete system observability",
],
outcome: "Logs become a proactive operational tool, not just debugging aid",
},
};
ELK Stack Implementation for Production:
# logging/elasticsearch/elasticsearch.yml - Production Elasticsearch cluster
cluster.name: "myapp-logs-production"
node.name: "${HOSTNAME}"
node.roles: [data, master, ingest]
# Network settings
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300
# Discovery for cluster formation
discovery.seed_hosts:
- "elasticsearch-0.elasticsearch-headless.logging.svc.cluster.local"
- "elasticsearch-1.elasticsearch-headless.logging.svc.cluster.local"
- "elasticsearch-2.elasticsearch-headless.logging.svc.cluster.local"
cluster.initial_master_nodes:
- "elasticsearch-0"
- "elasticsearch-1"
- "elasticsearch-2"
# Performance settings
bootstrap.memory_lock: true
indices.memory.index_buffer_size: 20%
indices.memory.min_index_buffer_size: 96mb
# Security settings
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: certs/elastic-certificates.p12
# Monitoring
xpack.monitoring.enabled: true
xpack.monitoring.collection.enabled: true
# Index lifecycle management
xpack.ilm.enabled: true
# Machine learning (for anomaly detection)
xpack.ml.enabled: true
xpack.ml.max_machine_memory_percent: 30
# logging/logstash/logstash.yml - Logstash pipeline configuration
node.name: "logstash-${HOSTNAME}"
path.data: /usr/share/logstash/data
path.config: /usr/share/logstash/pipeline
path.logs: /usr/share/logstash/logs
# Pipeline settings
pipeline.workers: 4
pipeline.batch.size: 2000
pipeline.batch.delay: 50
# Queue settings for reliability
queue.type: persisted
queue.max_bytes: 8gb
queue.checkpoint.writes: 1024
# Monitoring
monitoring.enabled: true
monitoring.elasticsearch.hosts:
- "https://elasticsearch:9200"
monitoring.elasticsearch.username: "logstash_system"
monitoring.elasticsearch.password: "${LOGSTASH_SYSTEM_PASSWORD}"
# Dead letter queue
dead_letter_queue.enable: true
dead_letter_queue.max_bytes: 2gb
# ========================================
# Pipeline Configuration
# ========================================
# logging/logstash/pipeline/main.conf - Comprehensive log processing pipeline
input {
# ========================================
# Application Logs via Filebeat
# ========================================
beats {
port => 5044
ssl => true
ssl_certificate_authorities => ["/usr/share/logstash/config/certs/ca.crt"]
ssl_certificate => "/usr/share/logstash/config/certs/logstash.crt"
ssl_key => "/usr/share/logstash/config/certs/logstash.key"
ssl_verify_mode => "force_peer"
}
# ========================================
# Kubernetes Logs via Fluent Bit
# ========================================
http {
port => 8080
codec => "json"
additional_codecs => {
"application/json" => "json"
}
}
# ========================================
# Database Logs (PostgreSQL)
# ========================================
jdbc {
jdbc_driver_library => "/usr/share/logstash/lib/postgresql.jar"
jdbc_driver_class => "org.postgresql.Driver"
jdbc_connection_string => "jdbc:postgresql://postgres:5432/logs"
jdbc_user => "${POSTGRES_USER}"
jdbc_password => "${POSTGRES_PASSWORD}"
schedule => "*/5 * * * * *"
statement => "
SELECT log_time, user_name, database_name, process_id,
connection_from, session_id, session_line_num, command_tag,
session_start_time, virtual_transaction_id, transaction_id,
error_severity, sql_state_code, message, detail, hint,
internal_query, internal_query_pos, context, query, query_pos,
location, application_name
FROM postgres_log
WHERE log_time > :sql_last_value
ORDER BY log_time ASC"
use_column_value => true
tracking_column => "log_time"
tracking_column_type => "timestamp"
}
# ========================================
# AWS CloudWatch Logs
# ========================================
cloudwatch_logs {
log_group => [
"/aws/lambda/myapp-*",
"/aws/apigateway/myapp",
"/aws/rds/instance/myapp-production/error"
]
region => "us-west-2"
aws_credentials_file => "/usr/share/logstash/config/aws_credentials"
interval => 60
start_position => "end"
}
}
filter {
# ========================================
# Parse and Enrich Application Logs
# ========================================
if [fields][log_type] == "application" {
# Parse JSON application logs
json {
source => "message"
target => "app"
}
# Extract timestamp
date {
match => [ "[app][timestamp]", "ISO8601" ]
target => "@timestamp"
}
# Parse log level
mutate {
add_field => { "log_level" => "%{[app][level]}" }
add_field => { "service_name" => "%{[app][service]}" }
add_field => { "trace_id" => "%{[app][traceId]}" }
add_field => { "span_id" => "%{[app][spanId]}" }
}
# Detect error patterns
if [app][level] == "error" or [app][level] == "fatal" {
mutate {
add_tag => [ "error", "alert" ]
}
# Extract stack trace
if [app][stack] {
mutate {
add_field => { "error_stack" => "%{[app][stack]}" }
}
}
}
# Parse HTTP request logs
if [app][http] {
mutate {
add_field => { "http_method" => "%{[app][http][method]}" }
add_field => { "http_status" => "%{[app][http][status]}" }
add_field => { "http_path" => "%{[app][http][path]}" }
add_field => { "response_time" => "%{[app][http][responseTime]}" }
add_field => { "user_agent" => "%{[app][http][userAgent]}" }
add_field => { "client_ip" => "%{[app][http][clientIp]}" }
}
# Convert response time to number
mutate {
convert => { "response_time" => "float" }
convert => { "http_status" => "integer" }
}
# Tag slow requests
if [response_time] and [response_time] > 2000 {
mutate {
add_tag => [ "slow_request" ]
}
}
# Tag error responses
if [http_status] >= 400 {
mutate {
add_tag => [ "http_error" ]
}
}
}
}
# ========================================
# Parse Nginx Access Logs
# ========================================
if [fields][log_type] == "nginx" {
grok {
match => {
"message" => "%{NGINXACCESS}"
}
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
}
mutate {
convert => {
"response" => "integer"
"bytes" => "integer"
"responsetime" => "float"
}
}
# GeoIP enrichment
geoip {
source => "clientip"
target => "geoip"
}
# User agent parsing
useragent {
source => "agent"
target => "user_agent"
}
}
# ========================================
# Parse Database Logs
# ========================================
if [fields][log_type] == "database" {
# Parse PostgreSQL logs
grok {
match => {
"message" => "%{TIMESTAMP_ISO8601:timestamp} \[%{DATA:process_id}\] %{WORD:log_level}: %{GREEDYDATA:log_message}"
}
}
# Extract slow query information
if [log_message] =~ /duration: (\d+\.\d+) ms/ {
grok {
match => {
"log_message" => "duration: %{NUMBER:query_duration:float} ms.*statement: %{GREEDYDATA:sql_query}"
}
}
if [query_duration] and [query_duration] > 1000 {
mutate {
add_tag => [ "slow_query" ]
}
}
}
# Extract connection information
if [log_message] =~ /connection/ {
mutate {
add_tag => [ "connection_event" ]
}
}
}
# ========================================
# Parse Kubernetes Logs
# ========================================
if [kubernetes] {
# Add Kubernetes metadata
mutate {
add_field => { "k8s_namespace" => "%{[kubernetes][namespace_name]}" }
add_field => { "k8s_pod" => "%{[kubernetes][pod_name]}" }
add_field => { "k8s_container" => "%{[kubernetes][container_name]}" }
add_field => { "k8s_node" => "%{[kubernetes][host]}" }
}
# Parse container logs
if [kubernetes][container_name] == "myapp" {
json {
source => "message"
target => "app"
}
}
}
# ========================================
# Security Event Detection
# ========================================
# Detect authentication failures
if [message] =~ /authentication failed|login failed|invalid credentials/i {
mutate {
add_tag => [ "security", "auth_failure" ]
add_field => { "security_event" => "authentication_failure" }
}
}
# Detect SQL injection attempts
if [message] =~ /union.*select|or.*1.*=.*1|drop.*table/i {
mutate {
add_tag => [ "security", "sql_injection" ]
add_field => { "security_event" => "sql_injection_attempt" }
}
}
# Detect unusual access patterns
if [client_ip] and [http_path] and [user_agent] {
# This would integrate with threat intelligence feeds
# For now, detect obvious bot patterns
if [user_agent] =~ /bot|crawler|spider|scraper/i {
mutate {
add_tag => [ "bot_traffic" ]
}
}
}
# ========================================
# Performance Metrics Extraction
# ========================================
# Extract database performance metrics
if "database" in [tags] and [query_duration] {
metrics {
meter => [ "database.queries" ]
timer => { "database.query_duration" => "%{query_duration}" }
increment => [ "database.slow_queries" ]
flush_interval => 30
}
}
# Extract HTTP performance metrics
if [response_time] and [http_status] {
metrics {
meter => [ "http.requests" ]
timer => { "http.response_time" => "%{response_time}" }
increment => [ "http.status.%{http_status}" ]
}
}
# ========================================
# Data Cleanup and Standardization
# ========================================
# Remove sensitive information
mutate {
gsub => [
"message", "password=[^&\s]*", "password=***",
"message", "token=[^&\s]*", "token=***",
"message", "api_key=[^&\s]*", "api_key=***"
]
}
# Add environment information
mutate {
add_field => {
"environment" => "production"
"log_processed_at" => "%{@timestamp}"
"logstash_node" => "${HOSTNAME}"
}
}
# Remove unnecessary fields to reduce storage
mutate {
remove_field => [ "[app][pid]", "[app][hostname]", "beat", "prospector" ]
}
}
output {
# ========================================
# Elasticsearch for Search and Analytics
# ========================================
elasticsearch {
hosts => ["elasticsearch-0:9200", "elasticsearch-1:9200", "elasticsearch-2:9200"]
ssl => true
ssl_certificate_verification => true
cacert => "/usr/share/logstash/config/certs/ca.crt"
user => "logstash_writer"
password => "${LOGSTASH_WRITER_PASSWORD}"
# Use index templates for better management
index => "logs-%{service_name:unknown}-%{+YYYY.MM.dd}"
template_name => "logs"
template => "/usr/share/logstash/templates/logs.json"
template_overwrite => true
# Document routing for better performance
routing => "%{service_name}"
# Retry configuration
retry_on_conflict => 3
retry_on_failure => 5
# Performance settings
bulk_size => 2000
flush_size => 2000
idle_flush_time => 1
}
# ========================================
# Real-time Alerting for Critical Events
# ========================================
if "error" in [tags] or "security" in [tags] {
http {
url => "https://alertmanager.myapp.com/api/v1/alerts"
http_method => "post"
headers => {
"Content-Type" => "application/json"
"Authorization" => "Bearer ${ALERTMANAGER_TOKEN}"
}
content_type => "application/json"
format => "json"
mapping => {
"alerts" => [
{
"labels" => {
"alertname" => "LogAlert"
"severity" => "warning"
"service" => "%{service_name}"
"environment" => "production"
}
"annotations" => {
"summary" => "Critical log event detected"
"description" => "%{message}"
"log_level" => "%{log_level}"
"timestamp" => "%{@timestamp}"
}
"generatorURL" => "https://kibana.myapp.com"
}
]
}
}
}
# ========================================
# Long-term Storage for Compliance
# ========================================
if [log_level] in ["error", "fatal"] or "security" in [tags] {
s3 {
access_key_id => "${AWS_ACCESS_KEY_ID}"
secret_access_key => "${AWS_SECRET_ACCESS_KEY}"
region => "us-west-2"
bucket => "myapp-logs-archive"
prefix => "year=%{+YYYY}/month=%{+MM}/day=%{+dd}/hour=%{+HH}"
codec => "json_lines"
time_file => 60
size_file => 100485760 # 100MB
# Server-side encryption
server_side_encryption_algorithm => "AES256"
# Lifecycle management via bucket policy
storage_class => "STANDARD_IA"
}
}
# ========================================
# Metrics Export to Prometheus
# ========================================
if [response_time] {
statsd {
host => "prometheus-statsd-exporter"
port => 8125
gauge => { "http_response_time" => "%{response_time}" }
increment => [ "http_requests_total" ]
sample_rate => 0.1
}
}
# ========================================
# Debug Output (Development Only)
# ========================================
# stdout {
# codec => rubydebug { metadata => true }
# }
}
Disaster Recovery and Backups: Business Continuity Excellence
From Hope-Based Recovery to Tested Resilience
Understanding disaster recovery maturity:
// Disaster recovery evolution: From hope to certainty
const disasterRecoveryMaturity = {
hopeBasedRecovery: {
approach: "Assume nothing bad will happen, deal with disasters reactively",
characteristics: [
"No formal backup strategy",
"Occasional manual backups",
"Recovery procedures untested",
"Single points of failure everywhere",
],
problems: [
"Data loss when disasters occur",
"Extended downtime during recovery",
"Recovery procedures fail when needed",
"Business operations halt completely",
],
reality: "Hope is not a strategy - disasters will happen",
},
basicBackupStrategy: {
approach: "Regular backups with basic recovery procedures",
characteristics: [
"Automated backup schedules",
"Multiple backup retention periods",
"Basic recovery documentation",
"Some redundancy in critical systems",
],
benefits: [
"Data protection against common failures",
"Faster recovery than no backup strategy",
"Some confidence in business continuity",
],
limitations:
"Recovery time still significant, procedures may fail under pressure",
},
comprehensiveDisasterRecovery: {
approach:
"Tested, automated disaster recovery with business continuity planning",
characteristics: [
"Multi-tier backup and recovery strategy",
"Automated failover and recovery systems",
"Regular disaster recovery testing",
"Cross-region redundancy and replication",
],
advantages: [
"Minimal data loss (RPO < 5 minutes)",
"Minimal downtime (RTO < 30 minutes)",
"Confidence through regular testing",
"Business operations continue seamlessly",
],
outcome: "Disasters become minor inconveniences, not existential threats",
},
};
Comprehensive Backup Strategy Implementation:
#!/bin/bash
# backup-strategy.sh - Professional multi-tier backup system
set -euo pipefail
# Configuration
PROJECT_NAME="${PROJECT_NAME:-myapp}"
ENVIRONMENT="${ENVIRONMENT:-production}"
AWS_REGION="${AWS_REGION:-us-west-2}"
BACKUP_BUCKET="${PROJECT_NAME}-backups-${ENVIRONMENT}"
RETENTION_DAYS_DAILY="${RETENTION_DAYS_DAILY:-30}"
RETENTION_DAYS_WEEKLY="${RETENTION_DAYS_WEEKLY:-90}"
RETENTION_DAYS_MONTHLY="${RETENTION_DAYS_MONTHLY:-365}"
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}
error() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] ERROR: $1" >&2
exit 1
}
# ========================================
# Database Backup Strategy
# ========================================
backup_postgresql() {
local db_host="$1"
local db_name="$2"
local backup_type="${3:-full}" # full, incremental
log "Starting PostgreSQL backup: $backup_type for $db_name"
local timestamp=$(date +%Y%m%d_%H%M%S)
local backup_dir="/tmp/backups/postgres"
local backup_file="$backup_dir/${db_name}_${backup_type}_${timestamp}"
mkdir -p "$backup_dir"
case "$backup_type" in
"full")
# Full database dump with compression
PGPASSWORD="$DB_PASSWORD" pg_dump \
--host="$db_host" \
--username="$DB_USER" \
--dbname="$db_name" \
--format=custom \
--compress=9 \
--verbose \
--file="${backup_file}.dump"
# Binary backup using pg_basebackup for faster recovery
PGPASSWORD="$DB_PASSWORD" pg_basebackup \
--host="$db_host" \
--username="$DB_USER" \
--format=tar \
--gzip \
--compress=9 \
--progress \
--verbose \
--wal-method=stream \
--directory="${backup_file}_basebackup"
# Tar the basebackup directory
tar -czf "${backup_file}_basebackup.tar.gz" -C "${backup_file}_basebackup" .
rm -rf "${backup_file}_basebackup"
;;
"incremental")
# WAL archive backup for point-in-time recovery
local wal_backup_dir="$backup_dir/wal_archive"
mkdir -p "$wal_backup_dir"
# Sync WAL files from archive location
aws s3 sync "s3://$BACKUP_BUCKET/wal_archive/" "$wal_backup_dir/" \
--region "$AWS_REGION"
# Create incremental backup metadata
cat > "${backup_file}_incremental.json" << EOF
{
"backup_type": "incremental",
"timestamp": "$timestamp",
"base_backup_lsn": "$(PGPASSWORD="$DB_PASSWORD" psql -h "$db_host" -U "$DB_USER" -d "$db_name" -t -c "SELECT pg_current_wal_lsn();" | tr -d ' ')",
"wal_files_count": $(find "$wal_backup_dir" -name "*.wal" | wc -l),
"total_size": "$(du -sh "$wal_backup_dir" | cut -f1)"
}
EOF
;;
esac
# Calculate checksums for integrity verification
find "$backup_dir" -name "*${timestamp}*" -type f -exec sha256sum {} \; > "${backup_file}_checksums.txt"
# Upload to S3 with server-side encryption
aws s3 sync "$backup_dir/" "s3://$BACKUP_BUCKET/postgres/" \
--region "$AWS_REGION" \
--storage-class STANDARD_IA \
--server-side-encryption AES256 \
--metadata backup_type="$backup_type",timestamp="$timestamp",environment="$ENVIRONMENT"
# Verify upload integrity
aws s3api head-object \
--bucket "$BACKUP_BUCKET" \
--key "postgres/${db_name}_${backup_type}_${timestamp}.dump" \
--region "$AWS_REGION" || error "Backup upload verification failed"
# Clean up local files
rm -rf "$backup_dir"/*${timestamp}*
log "PostgreSQL backup completed successfully"
}
# ========================================
# Kubernetes Resources Backup
# ========================================
backup_kubernetes_resources() {
local namespace="$1"
log "Starting Kubernetes resources backup for namespace: $namespace"
local timestamp=$(date +%Y%m%d_%H%M%S)
local backup_dir="/tmp/backups/kubernetes"
local backup_file="$backup_dir/k8s_${namespace}_${timestamp}"
mkdir -p "$backup_dir"
# Backup all resources in namespace
kubectl get all,configmaps,secrets,persistentvolumeclaims,ingresses \
--namespace="$namespace" \
--output=yaml > "${backup_file}_resources.yaml"
# Backup persistent volumes data using Velero if available
if command -v velero &> /dev/null; then
log "Creating Velero backup for namespace: $namespace"
velero backup create "backup-${namespace}-${timestamp}" \
--include-namespaces="$namespace" \
--storage-location=default \
--volume-snapshot-locations=default \
--ttl=720h
fi
# Backup cluster-level resources
kubectl get nodes,persistentvolumes,storageclasses,clusterroles,clusterrolebindings \
--output=yaml > "${backup_file}_cluster_resources.yaml"
# Backup Helm releases
if command -v helm &> /dev/null; then
helm list --namespace="$namespace" --output=json > "${backup_file}_helm_releases.json"
# Export each Helm release values
helm list --namespace="$namespace" --short | while read -r release; do
if [ -n "$release" ]; then
helm get values "$release" --namespace="$namespace" > "${backup_file}_helm_${release}_values.yaml"
fi
done
fi
# Create backup metadata
cat > "${backup_file}_metadata.json" << EOF
{
"backup_type": "kubernetes_resources",
"namespace": "$namespace",
"timestamp": "$timestamp",
"cluster_version": "$(kubectl version --short --client=false | grep Server | cut -d' ' -f3)",
"node_count": $(kubectl get nodes --no-headers | wc -l),
"pod_count": $(kubectl get pods --namespace="$namespace" --no-headers | wc -l)
}
EOF
# Compress backup files
tar -czf "${backup_file}.tar.gz" -C "$backup_dir" $(basename "${backup_file}")*
rm -f "${backup_file}"*
# Upload to S3
aws s3 cp "${backup_file}.tar.gz" "s3://$BACKUP_BUCKET/kubernetes/" \
--region "$AWS_REGION" \
--storage-class STANDARD_IA \
--server-side-encryption AES256
# Clean up local files
rm -f "${backup_file}.tar.gz"
log "Kubernetes backup completed successfully"
}
# ========================================
# Application Data Backup
# ========================================
backup_application_data() {
local data_path="$1"
local backup_name="$2"
log "Starting application data backup: $backup_name"
local timestamp=$(date +%Y%m%d_%H%M%S)
local backup_dir="/tmp/backups/application"
local backup_file="$backup_dir/${backup_name}_${timestamp}"
mkdir -p "$backup_dir"
# Create compressed archive with progress
tar -czf "${backup_file}.tar.gz" \
--directory="$(dirname "$data_path")" \
--verbose \
--exclude='*.tmp' \
--exclude='*.log' \
--exclude='cache/*' \
"$(basename "$data_path")"
# Generate integrity checksum
sha256sum "${backup_file}.tar.gz" > "${backup_file}.sha256"
# Create backup manifest
cat > "${backup_file}_manifest.json" << EOF
{
"backup_name": "$backup_name",
"source_path": "$data_path",
"timestamp": "$timestamp",
"size_bytes": $(stat -f%z "${backup_file}.tar.gz" 2>/dev/null || stat -c%s "${backup_file}.tar.gz"),
"file_count": $(tar -tzf "${backup_file}.tar.gz" | wc -l),
"checksum": "$(cut -d' ' -f1 < "${backup_file}.sha256")"
}
EOF
# Upload to S3 with lifecycle transition
aws s3 cp "${backup_file}.tar.gz" "s3://$BACKUP_BUCKET/application/" \
--region "$AWS_REGION" \
--storage-class STANDARD_IA \
--server-side-encryption AES256 \
--metadata backup_name="$backup_name",timestamp="$timestamp"
aws s3 cp "${backup_file}.sha256" "s3://$BACKUP_BUCKET/application/" \
--region "$AWS_REGION" \
--storage-class STANDARD_IA \
--server-side-encryption AES256
aws s3 cp "${backup_file}_manifest.json" "s3://$BACKUP_BUCKET/application/" \
--region "$AWS_REGION" \
--storage-class STANDARD_IA \
--server-side-encryption AES256
# Clean up local files
rm -f "${backup_file}"*
log "Application data backup completed successfully"
}
# ========================================
# Backup Verification and Testing
# ========================================
verify_backup_integrity() {
local backup_type="$1"
local backup_identifier="$2"
log "Verifying backup integrity: $backup_type/$backup_identifier"
case "$backup_type" in
"postgres")
# Download and verify database backup
local temp_dir="/tmp/verify_backup"
mkdir -p "$temp_dir"
aws s3 cp "s3://$BACKUP_BUCKET/postgres/${backup_identifier}.dump" \
"$temp_dir/" --region "$AWS_REGION"
# Verify dump file integrity
if ! pg_restore --list "$temp_dir/${backup_identifier}.dump" &>/dev/null; then
error "Database backup integrity check failed"
fi
;;
"kubernetes")
# Verify Kubernetes backup YAML validity
local temp_dir="/tmp/verify_backup"
mkdir -p "$temp_dir"
aws s3 cp "s3://$BACKUP_BUCKET/kubernetes/${backup_identifier}.tar.gz" \
"$temp_dir/" --region "$AWS_REGION"
tar -xzf "$temp_dir/${backup_identifier}.tar.gz" -C "$temp_dir"
# Validate YAML files
find "$temp_dir" -name "*.yaml" -exec kubectl --dry-run=client apply -f {} \; || \
error "Kubernetes backup validation failed"
;;
esac
log "Backup integrity verification passed"
cleanup_temp_files "$temp_dir"
}
# ========================================
# Backup Restoration Procedures
# ========================================
restore_database() {
local backup_identifier="$1"
local target_db_host="$2"
local target_db_name="$3"
log "Starting database restoration: $backup_identifier"
# Download backup from S3
local restore_dir="/tmp/restore"
mkdir -p "$restore_dir"
aws s3 cp "s3://$BACKUP_BUCKET/postgres/${backup_identifier}.dump" \
"$restore_dir/" --region "$AWS_REGION"
# Verify checksum if available
if aws s3 ls "s3://$BACKUP_BUCKET/postgres/${backup_identifier}_checksums.txt" --region "$AWS_REGION" &>/dev/null; then
aws s3 cp "s3://$BACKUP_BUCKET/postgres/${backup_identifier}_checksums.txt" \
"$restore_dir/" --region "$AWS_REGION"
cd "$restore_dir"
if ! sha256sum --check "${backup_identifier}_checksums.txt"; then
error "Backup file integrity check failed"
fi
cd - > /dev/null
fi
# Create restoration database if needed
PGPASSWORD="$DB_PASSWORD" psql \
--host="$target_db_host" \
--username="$DB_USER" \
--command="CREATE DATABASE ${target_db_name}_restore;" || true
# Restore database
PGPASSWORD="$DB_PASSWORD" pg_restore \
--host="$target_db_host" \
--username="$DB_USER" \
--dbname="${target_db_name}_restore" \
--verbose \
--clean \
--if-exists \
--no-owner \
--no-privileges \
"$restore_dir/${backup_identifier}.dump"
# Verify restoration
local restored_tables=$(PGPASSWORD="$DB_PASSWORD" psql \
--host="$target_db_host" \
--username="$DB_USER" \
--dbname="${target_db_name}_restore" \
--tuples-only \
--command="SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';")
if [ "$restored_tables" -eq 0 ]; then
error "Database restoration failed - no tables found"
fi
log "Database restoration completed successfully: $restored_tables tables restored"
cleanup_temp_files "$restore_dir"
}
# ========================================
# Backup Lifecycle Management
# ========================================
cleanup_old_backups() {
log "Starting backup lifecycle management"
local current_date=$(date +%s)
local daily_cutoff=$((current_date - (RETENTION_DAYS_DAILY * 86400)))
local weekly_cutoff=$((current_date - (RETENTION_DAYS_WEEKLY * 86400)))
local monthly_cutoff=$((current_date - (RETENTION_DAYS_MONTHLY * 86400)))
# Get all backup objects
aws s3api list-objects-v2 \
--bucket "$BACKUP_BUCKET" \
--region "$AWS_REGION" \
--query 'Contents[?StorageClass!=`GLACIER`].[Key,LastModified]' \
--output text | while read -r key last_modified; do
local modified_timestamp=$(date -d "$last_modified" +%s)
local age_days=$(( (current_date - modified_timestamp) / 86400 ))
# Apply retention policies
if [ $modified_timestamp -lt $monthly_cutoff ]; then
log "Deleting old backup: $key (age: $age_days days)"
aws s3 rm "s3://$BACKUP_BUCKET/$key" --region "$AWS_REGION"
elif [ $modified_timestamp -lt $weekly_cutoff ] && [[ ! "$key" =~ weekly ]]; then
# Transition to cheaper storage
aws s3api copy-object \
--bucket "$BACKUP_BUCKET" \
--copy-source "$BACKUP_BUCKET/$key" \
--key "$key" \
--storage-class GLACIER \
--metadata-directive COPY \
--region "$AWS_REGION"
fi
done
log "Backup lifecycle management completed"
}
# ========================================
# Disaster Recovery Testing
# ========================================
disaster_recovery_test() {
local test_type="$1" # full, database, application
log "Starting disaster recovery test: $test_type"
case "$test_type" in
"full")
# Test complete environment restoration
test_database_recovery
test_kubernetes_recovery
test_application_data_recovery
;;
"database")
test_database_recovery
;;
"application")
test_application_data_recovery
;;
esac
log "Disaster recovery test completed successfully"
}
test_database_recovery() {
log "Testing database recovery procedures"
# Find latest backup
local latest_backup=$(aws s3 ls "s3://$BACKUP_BUCKET/postgres/" --region "$AWS_REGION" | \
grep "full" | sort | tail -n1 | awk '{print $4}' | sed 's/.dump$//')
if [ -z "$latest_backup" ]; then
error "No database backups found for testing"
fi
# Test restoration to temporary database
restore_database "$latest_backup" "$DB_HOST" "test_restore_$(date +%s)"
log "Database recovery test passed"
}
# Main command dispatcher
case "${1:-help}" in
"postgres")
backup_postgresql "${2:-$DB_HOST}" "${3:-$DB_NAME}" "${4:-full}"
;;
"kubernetes")
backup_kubernetes_resources "${2:-production}"
;;
"application")
backup_application_data "${2:-/app/data}" "${3:-appdata}"
;;
"verify")
verify_backup_integrity "$2" "$3"
;;
"restore-db")
restore_database "$2" "${3:-$DB_HOST}" "${4:-$DB_NAME}"
;;
"cleanup")
cleanup_old_backups
;;
"test-dr")
disaster_recovery_test "${2:-database}"
;;
"full-backup")
backup_postgresql "$DB_HOST" "$DB_NAME" "full"
backup_kubernetes_resources "production"
backup_application_data "/app/data" "appdata"
;;
"help"|*)
cat << EOF
Professional Backup and Disaster Recovery System
Usage: $0 <command> [options]
Commands:
postgres <host> <dbname> [type] Backup PostgreSQL database
kubernetes <namespace> Backup Kubernetes resources
application <path> <name> Backup application data
verify <type> <identifier> Verify backup integrity
restore-db <backup> [host] [db] Restore database from backup
cleanup Clean up old backups per retention policy
test-dr [type] Run disaster recovery test
full-backup Run complete backup suite
Examples:
$0 postgres mydb-host myapp full
$0 kubernetes production
$0 application /app/data userdata
$0 verify postgres myapp_full_20231201_120000
$0 test-dr full
EOF
;;
esac
Conclusion: From Infrastructure Amateur to Operations Excellence
You’ve now mastered the complete deployment and infrastructure ecosystem that separates professional operations from amateur setups that crumble under real-world pressure.
What you’ve accomplished:
- CI/CD Pipeline Mastery: Automated deployment pipelines with comprehensive testing, security scanning, and intelligent deployment strategies that eliminate manual errors and enable confident, frequent releases
- Advanced Infrastructure as Code: Modular, collaborative infrastructure management with state management, environment consistency, and collaborative workflows that make infrastructure changes predictable and reviewable
- Proactive Monitoring Excellence: Comprehensive observability with intelligent alerting, anomaly detection, and operational insights that prevent problems instead of reacting to them
- Intelligent Log Management: Centralized log aggregation with real-time analysis, security event detection, and operational intelligence that transforms logs from debugging aids into proactive operational tools
- Battle-tested Disaster Recovery: Multi-tier backup strategies with automated recovery procedures, regular testing, and business continuity planning that makes disasters minor inconveniences instead of existential threats
The professional operations transformation you’ve achieved:
// Your operations evolution: From amateur to excellence
const operationalTransformation = {
before: {
deployments: "Manual SSH, pray nothing breaks, debug in production",
monitoring: "Customers tell us when things break via angry emails",
infrastructure:
"Snowflake servers, configuration drift, single points of failure",
logging: "SSH and grep through scattered log files when debugging",
disasterRecovery: "Hope nothing bad happens, panic when it does",
teamProductivity: "80% time firefighting, 20% building features",
customerExperience:
"Unpredictable outages, slow performance, data loss risk",
},
after: {
deployments:
"Automated CI/CD with zero-downtime, automatic rollback, comprehensive testing",
monitoring:
"Proactive alerts before customer impact, predictive problem prevention",
infrastructure:
"Infrastructure as code, consistent environments, auto-scaling resilience",
logging:
"Centralized intelligence with real-time analysis and security detection",
disasterRecovery:
"Tested procedures with automated failover and minimal downtime",
teamProductivity:
"5% time on operations, 95% time on innovation and features",
customerExperience: "Reliable service, optimal performance, zero data loss",
},
businessImpact: [
"Deploy 10x more frequently with higher reliability",
"Mean time to resolution reduced from hours to minutes",
"Infrastructure costs optimized through automation and monitoring",
"Development velocity increased through operational excellence",
"Customer satisfaction improved through reliable service delivery",
"Competitive advantage through faster innovation cycles",
],
};
You now operate infrastructure that scales confidently, deploys reliably, monitors proactively, and recovers gracefully. Your systems enable teams to focus on building value instead of fighting operational fires.
But here’s the production reality that creates the biggest impact: these operational practices aren’t just about technology—they’re about enabling business success. While your competitors struggle with manual deployments, reactive monitoring, and disaster recovery panic, your infrastructure becomes an invisible foundation that just works, allowing your team to outpace the market with rapid innovation and reliable service delivery.
Your operations are no longer a liability that might break—they’re a competitive advantage that enables greatness. The next time someone asks if you’re ready for production scale, you won’t just say yes—you’ll demonstrate it with the confidence that comes from professional operational excellence.
Welcome to the ranks of engineers who build systems that businesses can depend on. Your infrastructure is now ready for whatever scale and challenges come next.