Deployment & Infrastructure - 2/2

From Infrastructure Foundations to Production Excellence

You’ve mastered professional deployment strategies with rolling updates, blue-green deployments, and canary releases that handle traffic gracefully, established cloud-native infrastructure on AWS, GCP, and Azure with proper networking and security, implemented Infrastructure as Code with Terraform for consistent, version-controlled provisioning, and set up load balancers, CDNs, and server management that scales horizontally. Your infrastructure now operates as enterprise-grade systems that deploy reliably and scale automatically. But here’s the production reality that separates functional infrastructure from world-class operations: perfect infrastructure means nothing if your deployment process requires manual intervention, lacks comprehensive monitoring that detects issues before customers notice, has no disaster recovery plan for when things go catastrophically wrong, and operates without the CI/CD automation that enables teams to deploy confidently dozens of times per day.

The production operations nightmare that destroys scalable businesses:

# Your operations horror story
# CEO: "We need to deploy the critical bug fix NOW, customers are churning"

# Attempt 1: Manual deployment at 2 AM
$ ssh production-server
production$ git pull origin main
# Merge conflict in critical configuration file
# No automated tests, deploying blind

$ sudo systemctl restart myapp
Job for myapp.service failed because the control process exited with error code.
# Service won't start, no clear error logs

$ sudo journalctl -u myapp
# 50,000 lines of generic logs, needle in haystack
# No structured logging, no error aggregation

# Attempt 2: Emergency rollback
$ git log --oneline
# 47 commits since last known good state
# No release tags, no deployment tracking
# Which commit was actually deployed last?

$ git checkout HEAD~5
$ sudo systemctl restart myapp
# Service starts but database migrations are incompatible
# Data corruption in production database

# Attempt 3: Infrastructure disaster
# Primary database server dies during peak traffic
$ aws rds describe-db-instances --db-instance-identifier prod-db
{
    "DBInstanceStatus": "failed"
}
# No automated failover, no backups tested in 6 months
# Customer data potentially lost forever

# Attempt 4: Monitoring blindness
# Load balancer shows 500 errors for 30 minutes
# First alert comes from angry customer on Twitter
# No proactive monitoring, no alerting
# "Why didn't anyone tell us the site was down?"

# The cascading operations disasters:
# - No CI/CD pipeline, deployments via "git pull and pray"
# - No automated testing, bugs discovered by customers
# - No monitoring, outages discovered via social media
# - No logging strategy, debugging takes hours
# - No disaster recovery, single points of failure everywhere
# - No backup strategy, data loss risk on every failure
# - No change tracking, impossible to identify what broke

# Result: 8-hour outage during Black Friday
# $2M in lost revenue, 30% customer churn
# Engineering team working 16-hour days for weeks
# Company reputation destroyed, acquisition talks canceled
# The brutal truth: Great infrastructure can't save amateur operations

The uncomfortable production truth: Perfect infrastructure and deployment strategies can’t save you from operational disasters when your CI/CD pipeline is non-existent, monitoring is reactive instead of proactive, disaster recovery is untested, and your team is debugging production issues instead of preventing them. Professional operations requires thinking beyond infrastructure to the entire development lifecycle.

Real-world operations failure consequences:

// What happens when operations practices are amateur:
const operationsFailureImpact = {
  deploymentDisasters: {
    problem: "Critical bug fix deployment breaks entire application",
    cause: "No CI/CD pipeline, no automated testing, manual deployments",
    impact: "6-hour outage during peak business hours, revenue loss",
    cost: "$500K in lost sales, 20% customer churn",
  },

  monitoringBlindness: {
    problem: "Performance degradation goes unnoticed for hours",
    cause: "No proactive monitoring, no alerting, reactive debugging",
    impact: "Customers experience slow site, competitors gain market share",
    consequences: "Brand reputation damaged, customer satisfaction plummets",
  },

  disasterRecoveryFailure: {
    problem: "Database corruption during peak season with no recovery plan",
    cause:
      "Untested backups, no disaster recovery procedures, single points of failure",
    impact: "Complete data loss, business operations halt for days",
    reality: "Company closes permanently, all customer data lost forever",
  },

  operationalChaos: {
    problem:
      "Teams spend 80% of time firefighting instead of building features",
    cause: "No automation, no monitoring, no proper deployment processes",
    impact: "Product development stagnates, competitors outpace innovation",
    prevention:
      "Professional operations enable teams to focus on value creation",
  },

  // Perfect infrastructure is worthless when operations
  // lack automation, monitoring, disaster recovery, and reliability practices
};

Production operations mastery requires understanding:

  • CI/CD pipelines that automate the entire software delivery lifecycle with comprehensive testing and deployment automation
  • Advanced Infrastructure as Code that manages complex environments with modules, state management, and collaborative workflows
  • Monitoring and alerting that proactively detects issues and provides actionable insights before customers are affected
  • Log aggregation and analysis that enables rapid debugging and system understanding through structured, searchable data
  • Disaster recovery and backups that ensure business continuity with tested, automated recovery procedures

This article transforms your operations from manual, reactive processes into automated, proactive systems that enable reliable, fast, and confident software delivery at enterprise scale.


CI/CD Pipelines: From Manual Deployments to Automated Excellence

The Evolution from Git Push to Production Excellence

Understanding why manual deployments kill productivity and reliability:

// Manual deployment vs Professional CI/CD pipeline comparison
const deploymentEvolution = {
  manualDeployment: {
    process: "Developer manually deploys from their laptop",
    testing: "Maybe run a few tests locally if remembered",
    consistency:
      "Every deployment is different, configuration drift guaranteed",
    rollback: "Panic-driven git revert, usually makes things worse",
    visibility: "No one knows what's deployed or when",
    scalability: "One person becomes deployment bottleneck",
    quality: "Bugs discovered by customers in production",
    reliability: "50% chance of deployment causing outage",
  },

  professionalCICD: {
    process: "Automated pipeline triggered by git events",
    testing: "Comprehensive automated test suite runs every time",
    consistency: "Identical deployment process every single time",
    rollback: "One-click automated rollback to any previous version",
    visibility: "Full deployment history and current state tracking",
    scalability: "Multiple teams deploying dozens of times per day",
    quality: "Issues caught in CI/CD before reaching production",
    reliability: "99.9% successful deployments, predictable outcomes",
  },

  theTransformationImpact: [
    "Teams deploy 10x more frequently with higher confidence",
    "Bug detection shifts left, issues caught in minutes not hours",
    "Rollbacks happen in seconds, not emergency all-hands meetings",
    "Developers focus on features, not deployment firefighting",
    "Quality improves dramatically with automated testing",
    "Infrastructure changes become routine, not risky events",
  ],
};

GitHub Actions CI/CD Pipeline Implementation:

# .github/workflows/production-deployment.yml - Professional CI/CD pipeline
name: Production Deployment Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  workflow_dispatch:
    inputs:
      environment:
        description: "Deployment environment"
        required: true
        default: "staging"
        type: choice
        options:
          - staging
          - production

env:
  NODE_VERSION: "18"
  DOCKER_REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  # ========================================
  # Code Quality and Security Analysis
  # ========================================
  code-quality:
    name: Code Quality Analysis
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          fetch-depth: 0 # Needed for SonarCloud

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "npm"

      - name: Install dependencies
        run: |
          npm ci --prefer-offline --no-audit

      - name: Run ESLint with annotations
        run: |
          npx eslint . --format=@microsoft/eslint-formatter-sarif --output-file eslint-results.sarif
          npx eslint . --format=stylish
        continue-on-error: true

      - name: Upload ESLint results to GitHub
        uses: github/codeql-action/upload-sarif@v2
        if: always()
        with:
          sarif_file: eslint-results.sarif

      - name: Run Prettier check
        run: npx prettier --check .

      - name: TypeScript type checking
        run: npx tsc --noEmit

      - name: Security audit
        run: |
          npm audit --audit-level=high
          npx audit-ci --config .auditrc.json

      - name: License compliance check
        run: |
          npx license-checker --onlyAllow "MIT;Apache-2.0;BSD-2-Clause;BSD-3-Clause;ISC" --excludePrivatePackages

      - name: SonarCloud Scan
        uses: SonarSource/sonarcloud-github-action@master
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}

  # ========================================
  # Comprehensive Testing Suite
  # ========================================
  unit-tests:
    name: Unit Tests
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "npm"

      - name: Install dependencies
        run: npm ci --prefer-offline --no-audit

      - name: Run unit tests with coverage
        run: |
          npm run test:unit -- --coverage --watchAll=false --ci

      - name: Upload coverage to Codecov
        uses: codecov/codecov-action@v3
        with:
          token: ${{ secrets.CODECOV_TOKEN }}
          file: ./coverage/lcov.info
          flags: unittests
          name: codecov-umbrella

      - name: Comment coverage on PR
        if: github.event_name == 'pull_request'
        uses: codecov/codecov-action@v3

  integration-tests:
    name: Integration Tests
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_DB: testdb
          POSTGRES_USER: testuser
          POSTGRES_PASSWORD: testpass
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 5432:5432

      redis:
        image: redis:7-alpine
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 6379:6379

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "npm"

      - name: Install dependencies
        run: npm ci --prefer-offline --no-audit

      - name: Wait for services to be ready
        run: |
          timeout 60 bash -c 'until nc -z localhost 5432; do sleep 1; done'
          timeout 60 bash -c 'until nc -z localhost 6379; do sleep 1; done'

      - name: Run database migrations
        run: npm run db:migrate
        env:
          DATABASE_URL: postgresql://testuser:testpass@localhost:5432/testdb
          REDIS_URL: redis://localhost:6379

      - name: Run integration tests
        run: npm run test:integration
        env:
          NODE_ENV: test
          DATABASE_URL: postgresql://testuser:testpass@localhost:5432/testdb
          REDIS_URL: redis://localhost:6379

  e2e-tests:
    name: End-to-End Tests
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: "npm"

      - name: Install dependencies
        run: npm ci --prefer-offline --no-audit

      - name: Build application
        run: npm run build

      - name: Start application for E2E tests
        run: |
          npm start &
          timeout 60 bash -c 'until curl -f http://localhost:3000/health; do sleep 2; done'
        env:
          NODE_ENV: test
          PORT: 3000

      - name: Run Playwright E2E tests
        run: npx playwright test
        env:
          BASE_URL: http://localhost:3000

      - name: Upload Playwright report
        uses: actions/upload-artifact@v3
        if: always()
        with:
          name: playwright-report
          path: playwright-report/

  # ========================================
  # Container Security and Building
  # ========================================
  container-security:
    name: Container Security Scan
    runs-on: ubuntu-latest
    needs: [code-quality]
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Build Docker image for scanning
        run: |
          docker build -t security-scan:latest .

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: "security-scan:latest"
          format: "sarif"
          output: "trivy-results.sarif"

      - name: Upload Trivy scan results to GitHub Security
        uses: github/codeql-action/upload-sarif@v2
        if: always()
        with:
          sarif_file: "trivy-results.sarif"

      - name: Run Hadolint Dockerfile linting
        uses: hadolint/hadolint-action@v3.1.0
        with:
          dockerfile: Dockerfile
          format: sarif
          output-file: hadolint-results.sarif

      - name: Upload Hadolint results
        uses: github/codeql-action/upload-sarif@v2
        if: always()
        with:
          sarif_file: hadolint-results.sarif

  build-and-push:
    name: Build and Push Container
    runs-on: ubuntu-latest
    needs: [unit-tests, integration-tests, container-security]
    if: github.ref == 'refs/heads/main' || github.event_name == 'workflow_dispatch'
    outputs:
      image-digest: ${{ steps.build.outputs.digest }}
      image-tag: ${{ steps.meta.outputs.tags }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.DOCKER_REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=ref,event=branch
            type=sha,prefix={{branch}}-
            type=raw,value=latest,enable={{is_default_branch}}

      - name: Build and push Docker image
        id: build
        uses: docker/build-push-action@v5
        with:
          context: .
          platforms: linux/amd64,linux/arm64
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
          build-args: |
            BUILDTIME=${{ fromJSON(steps.meta.outputs.json).labels['org.opencontainers.image.created'] }}
            VERSION=${{ fromJSON(steps.meta.outputs.json).labels['org.opencontainers.image.version'] }}
            REVISION=${{ fromJSON(steps.meta.outputs.json).labels['org.opencontainers.image.revision'] }}

  # ========================================
  # Staging Deployment and Testing
  # ========================================
  deploy-staging:
    name: Deploy to Staging
    runs-on: ubuntu-latest
    needs: [build-and-push]
    if: github.ref == 'refs/heads/main'
    environment:
      name: staging
      url: https://staging.myapp.com
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: us-west-2

      - name: Setup Kubernetes config
        run: |
          aws eks update-kubeconfig --name myapp-staging-cluster --region us-west-2

      - name: Deploy to staging with Helm
        run: |
          helm upgrade --install myapp-staging ./helm/myapp \
            --namespace staging \
            --create-namespace \
            --set image.repository="${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_NAME }}" \
            --set image.tag="${{ github.sha }}" \
            --set environment="staging" \
            --set ingress.host="staging.myapp.com" \
            --wait --timeout=10m

      - name: Wait for deployment to be ready
        run: |
          kubectl rollout status deployment/myapp-staging -n staging --timeout=600s

      - name: Run smoke tests against staging
        run: |
          timeout 300 bash -c 'until curl -f https://staging.myapp.com/health; do sleep 10; done'
          npm run test:smoke -- --baseURL=https://staging.myapp.com

      - name: Run load tests against staging
        run: |
          npm run test:load -- --baseURL=https://staging.myapp.com

  # ========================================
  # Production Deployment with Approval
  # ========================================
  deploy-production:
    name: Deploy to Production
    runs-on: ubuntu-latest
    needs: [deploy-staging, e2e-tests]
    if: github.ref == 'refs/heads/main' || (github.event_name == 'workflow_dispatch' && github.event.inputs.environment == 'production')
    environment:
      name: production
      url: https://myapp.com
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_PROD_ROLE_ARN }}
          aws-region: us-west-2

      - name: Setup Kubernetes config
        run: |
          aws eks update-kubeconfig --name myapp-production-cluster --region us-west-2

      - name: Pre-deployment health check
        run: |
          kubectl get nodes
          kubectl get pods -A | grep -E "(Crash|Error|ImagePull)" && exit 1 || true

      - name: Deploy to production with blue-green strategy
        run: |
          # Deploy to inactive environment (green if blue is active)
          ACTIVE_ENV=$(kubectl get service myapp-active -o jsonpath='{.spec.selector.version}' || echo "blue")
          TARGET_ENV=$([ "$ACTIVE_ENV" = "blue" ] && echo "green" || echo "blue")

          echo "Deploying to $TARGET_ENV environment (current active: $ACTIVE_ENV)"

          helm upgrade --install myapp-$TARGET_ENV ./helm/myapp \
            --namespace production \
            --create-namespace \
            --set image.repository="${{ env.DOCKER_REGISTRY }}/${{ env.IMAGE_NAME }}" \
            --set image.tag="${{ github.sha }}" \
            --set environment="production" \
            --set deployment.version="$TARGET_ENV" \
            --set ingress.host="myapp.com" \
            --wait --timeout=15m

      - name: Validate new deployment
        run: |
          TARGET_ENV=$(kubectl get deployment -l version!=active -o jsonpath='{.items[0].metadata.labels.version}')

          # Wait for deployment to be fully ready
          kubectl rollout status deployment/myapp-$TARGET_ENV -n production --timeout=600s

          # Run comprehensive health checks
          kubectl port-forward svc/myapp-$TARGET_ENV 8080:80 -n production &
          PF_PID=$!
          sleep 10

          # Smoke tests
          curl -f http://localhost:8080/health || exit 1
          npm run test:smoke -- --baseURL=http://localhost:8080

          kill $PF_PID

      - name: Switch traffic to new deployment
        run: |
          TARGET_ENV=$(kubectl get deployment -l version!=active -o jsonpath='{.items[0].metadata.labels.version}')

          echo "Switching traffic to $TARGET_ENV"
          kubectl patch service myapp-active -p '{"spec":{"selector":{"version":"'$TARGET_ENV'"}}}'

          # Update labels to mark new environment as active
          kubectl label deployment myapp-$TARGET_ENV version=active --overwrite

          # Mark old environment as inactive
          OLD_ENV=$([ "$TARGET_ENV" = "blue" ] && echo "green" || echo "blue")
          kubectl label deployment myapp-$OLD_ENV version=inactive --overwrite

      - name: Post-deployment monitoring
        run: |
          echo "Monitoring deployment for 5 minutes..."
          for i in {1..30}; do
            if ! curl -f https://myapp.com/health; then
              echo "Health check failed, initiating rollback"
              # Rollback logic would go here
              exit 1
            fi
            sleep 10
          done
          echo "Deployment stable, monitoring successful"

      - name: Cleanup old deployment
        run: |
          OLD_ENV=$(kubectl get deployment -l version=inactive -o jsonpath='{.items[0].metadata.labels.version}')
          if [ -n "$OLD_ENV" ]; then
            echo "Cleaning up old deployment: $OLD_ENV"
            helm uninstall myapp-$OLD_ENV --namespace production || true
          fi

  # ========================================
  # Post-deployment notifications
  # ========================================
  notify-deployment:
    name: Notify Deployment Status
    runs-on: ubuntu-latest
    needs: [deploy-production]
    if: always()
    steps:
      - name: Notify Slack on success
        if: needs.deploy-production.result == 'success'
        uses: 8398a7/action-slack@v3
        with:
          status: success
          channel: "#deployments"
          text: |
            🎉 Production deployment successful!

            **Repository:** ${{ github.repository }}
            **Branch:** ${{ github.ref_name }}
            **Commit:** ${{ github.sha }}
            **Author:** ${{ github.actor }}

            **Deployed to:** https://myapp.com
            **Dashboard:** https://grafana.myapp.com
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

      - name: Notify Slack on failure
        if: needs.deploy-production.result == 'failure'
        uses: 8398a7/action-slack@v3
        with:
          status: failure
          channel: "#alerts"
          text: |
            🚨 Production deployment failed!

            **Repository:** ${{ github.repository }}
            **Branch:** ${{ github.ref_name }}
            **Commit:** ${{ github.sha }}
            **Author:** ${{ github.actor }}

            **Action:** https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}

            Please investigate immediately.
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

      - name: Create GitHub release on successful production deploy
        if: needs.deploy-production.result == 'success' && github.ref == 'refs/heads/main'
        uses: actions/create-release@v1
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          tag_name: production-${{ github.run_number }}
          release_name: Production Release ${{ github.run_number }}
          body: |
            Automated production release

            **Commit:** ${{ github.sha }}
            **Deployed:** $(date -u +"%Y-%m-%d %H:%M:%S UTC")

            [View deployment workflow](https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }})
          draft: false
          prerelease: false

GitLab CI/CD Pipeline Implementation:

# .gitlab-ci.yml - Professional GitLab CI/CD pipeline
stages:
  - quality
  - test
  - security
  - build
  - deploy-staging
  - deploy-production
  - monitor

variables:
  DOCKER_REGISTRY: $CI_REGISTRY
  IMAGE_NAME: $CI_REGISTRY_IMAGE
  KUBERNETES_VERSION: "1.28"

# ========================================
# Quality and Security Analysis
# ========================================
code-quality:
  stage: quality
  image: node:18-alpine
  cache:
    paths:
      - node_modules/
  script:
    - npm ci --prefer-offline
    - npm run lint -- --format=junit --output-file=eslint-report.xml
    - npm run prettier:check
    - npx tsc --noEmit
  artifacts:
    reports:
      junit: eslint-report.xml
    paths:
      - eslint-report.xml
    expire_in: 1 week

dependency-security:
  stage: quality
  image: node:18-alpine
  script:
    - npm ci --prefer-offline
    - npm audit --audit-level=high
    - npx audit-ci --config .auditrc.json
    - npx license-checker --onlyAllow "MIT;Apache-2.0;BSD-2-Clause;BSD-3-Clause;ISC"
  allow_failure: false

sonarcloud-check:
  stage: quality
  image: sonarsource/sonar-scanner-cli:latest
  variables:
    SONAR_USER_HOME: "${CI_PROJECT_DIR}/.sonar"
    GIT_DEPTH: "0"
  cache:
    key: "${CI_JOB_NAME}"
    paths:
      - .sonar/cache
  script:
    - sonar-scanner
  only:
    - main
    - merge_requests

# ========================================
# Comprehensive Testing
# ========================================
unit-tests:
  stage: test
  image: node:18-alpine
  services:
    - postgres:15
    - redis:7-alpine
  variables:
    POSTGRES_DB: testdb
    POSTGRES_USER: testuser
    POSTGRES_PASSWORD: testpass
    REDIS_URL: redis://redis:6379
    DATABASE_URL: postgresql://testuser:testpass@postgres:5432/testdb
  cache:
    paths:
      - node_modules/
  before_script:
    - npm ci --prefer-offline
    - npm run db:migrate
  script:
    - npm run test:unit -- --coverage --ci --watchAll=false
    - npm run test:integration
  coverage: '/All files[^|]*\|[^|]*\s+([\d\.]+)/'
  artifacts:
    reports:
      coverage_report:
        coverage_format: cobertura
        path: coverage/cobertura-coverage.xml
      junit: junit.xml
    paths:
      - coverage/
    expire_in: 1 week

e2e-tests:
  stage: test
  image: mcr.microsoft.com/playwright:v1.40.0-focal
  services:
    - postgres:15
    - redis:7-alpine
  variables:
    POSTGRES_DB: testdb
    POSTGRES_USER: testuser
    POSTGRES_PASSWORD: testpass
    DATABASE_URL: postgresql://testuser:testpass@postgres:5432/testdb
    REDIS_URL: redis://redis:6379
  before_script:
    - npm ci --prefer-offline
    - npm run build
    - npm run db:migrate
    - npm start &
    - sleep 30
    - curl -f http://localhost:3000/health
  script:
    - npx playwright test
  artifacts:
    when: always
    paths:
      - playwright-report/
      - test-results/
    expire_in: 1 week

# ========================================
# Security and Container Scanning
# ========================================
container-security:
  stage: security
  image: docker:20.10.16
  services:
    - docker:20.10.16-dind
  variables:
    DOCKER_TLS_CERTDIR: "/certs"
    TRIVY_CACHE_DIR: ".trivycache/"
  before_script:
    - docker build -t $IMAGE_NAME:$CI_COMMIT_SHA .
    - apk add --no-cache curl
    - curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh | sh -s -- -b /usr/local/bin
  script:
    - trivy image --exit-code 1 --severity HIGH,CRITICAL --no-progress $IMAGE_NAME:$CI_COMMIT_SHA
    - trivy image --format template --template "@contrib/gitlab.tpl" --output gl-container-scanning-report.json $IMAGE_NAME:$CI_COMMIT_SHA
  cache:
    paths:
      - .trivycache/
  artifacts:
    reports:
      container_scanning: gl-container-scanning-report.json
    expire_in: 1 week

dockerfile-lint:
  stage: security
  image: hadolint/hadolint:latest-alpine
  script:
    - hadolint --format gitlab_codeclimate --failure-threshold warning Dockerfile > hadolint-report.json
  artifacts:
    reports:
      codequality: hadolint-report.json
    expire_in: 1 week

# ========================================
# Build and Registry
# ========================================
build-image:
  stage: build
  image: docker:20.10.16
  services:
    - docker:20.10.16-dind
  variables:
    DOCKER_TLS_CERTDIR: "/certs"
  before_script:
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
  script:
    - |
      # Multi-arch build with buildx
      docker buildx create --use
      docker buildx build \
        --platform linux/amd64,linux/arm64 \
        --tag $IMAGE_NAME:$CI_COMMIT_SHA \
        --tag $IMAGE_NAME:latest \
        --push \
        --build-arg BUILDTIME=$(date -u +'%Y-%m-%dT%H:%M:%SZ') \
        --build-arg VERSION=$CI_COMMIT_TAG \
        --build-arg REVISION=$CI_COMMIT_SHA \
        .
  only:
    - main
    - tags

# ========================================
# Staging Deployment
# ========================================
deploy-staging:
  stage: deploy-staging
  image:
    name: alpine/helm:3.12.0
    entrypoint: [""]
  environment:
    name: staging
    url: https://staging.myapp.com
  before_script:
    - apk add --no-cache curl bash
    - curl -LO "https://dl.k8s.io/release/v$KUBERNETES_VERSION/bin/linux/amd64/kubectl"
    - install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
    - echo $KUBE_CONFIG_STAGING | base64 -d > ~/.kube/config
  script:
    - |
      helm upgrade --install myapp-staging ./helm/myapp \
        --namespace staging \
        --create-namespace \
        --set image.repository=$IMAGE_NAME \
        --set image.tag=$CI_COMMIT_SHA \
        --set environment=staging \
        --set ingress.host=staging.myapp.com \
        --wait --timeout=10m
        
      kubectl rollout status deployment/myapp-staging -n staging --timeout=600s

      # Smoke tests
      sleep 30
      curl -f https://staging.myapp.com/health
  only:
    - main

staging-smoke-tests:
  stage: deploy-staging
  image: node:18-alpine
  dependencies:
    - deploy-staging
  script:
    - npm ci --prefer-offline
    - npm run test:smoke -- --baseURL=https://staging.myapp.com
    - npm run test:load -- --baseURL=https://staging.myapp.com --duration=5m
  only:
    - main

# ========================================
# Production Deployment
# ========================================
deploy-production:
  stage: deploy-production
  image:
    name: alpine/helm:3.12.0
    entrypoint: [""]
  environment:
    name: production
    url: https://myapp.com
  before_script:
    - apk add --no-cache curl bash jq
    - curl -LO "https://dl.k8s.io/release/v$KUBERNETES_VERSION/bin/linux/amd64/kubectl"
    - install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
    - echo $KUBE_CONFIG_PRODUCTION | base64 -d > ~/.kube/config
  script:
    - |
      # Blue-green deployment strategy
      ACTIVE_ENV=$(kubectl get service myapp-active -o jsonpath='{.spec.selector.version}' 2>/dev/null || echo "blue")
      TARGET_ENV=$([ "$ACTIVE_ENV" = "blue" ] && echo "green" || echo "blue")

      echo "Deploying to $TARGET_ENV (current active: $ACTIVE_ENV)"

      # Deploy to target environment
      helm upgrade --install myapp-$TARGET_ENV ./helm/myapp \
        --namespace production \
        --create-namespace \
        --set image.repository=$IMAGE_NAME \
        --set image.tag=$CI_COMMIT_SHA \
        --set environment=production \
        --set deployment.version=$TARGET_ENV \
        --set ingress.host=myapp.com \
        --wait --timeout=15m

      kubectl rollout status deployment/myapp-$TARGET_ENV -n production --timeout=600s

      # Validation tests
      kubectl port-forward svc/myapp-$TARGET_ENV 8080:80 -n production &
      PF_PID=$!
      sleep 15

      curl -f http://localhost:8080/health || exit 1
      kill $PF_PID

      # Switch traffic
      kubectl patch service myapp-active -p '{"spec":{"selector":{"version":"'$TARGET_ENV'"}}}'
      kubectl label deployment myapp-$TARGET_ENV version=active --overwrite

      # Monitor for 5 minutes
      for i in {1..30}; do
        curl -f https://myapp.com/health || exit 1
        sleep 10
      done

      # Cleanup old deployment
      OLD_ENV=$([ "$TARGET_ENV" = "blue" ] && echo "green" || echo "blue")
      helm uninstall myapp-$OLD_ENV --namespace production || true

  when: manual
  only:
    - main

# ========================================
# Monitoring and Notifications
# ========================================
post-deploy-monitoring:
  stage: monitor
  image: curlimages/curl:latest
  script:
    - |
      # Send deployment notification
      curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"✅ Production deployment successful!\n**Project:** $CI_PROJECT_NAME\n**Commit:** $CI_COMMIT_SHA\n**Pipeline:** $CI_PIPELINE_URL\"}" \
        $SLACK_WEBHOOK_URL

      # Trigger monitoring dashboard update
      curl -X POST -H "Authorization: Bearer $GRAFANA_API_KEY" \
        -H "Content-Type: application/json" \
        -d '{"dashboard": {"id": null, "title": "Deployment Tracking", "tags": ["deployment"], "timezone": "browser"}}' \
        $GRAFANA_URL/api/dashboards/db
  dependencies:
    - deploy-production
  when: on_success
  only:
    - main

Advanced Infrastructure as Code: Modules, State, and Collaboration

Beyond Basic Terraform: Enterprise Infrastructure Management

Understanding advanced Infrastructure as Code patterns:

# terraform/modules/networking/main.tf - Reusable networking module
variable "project_name" {
  description = "Name of the project"
  type        = string
}

variable "environment" {
  description = "Environment (dev, staging, production)"
  type        = string
}

variable "region" {
  description = "AWS region"
  type        = string
}

variable "availability_zones" {
  description = "List of availability zones"
  type        = list(string)
}

variable "vpc_cidr" {
  description = "CIDR block for VPC"
  type        = string
  default     = "10.0.0.0/16"
}

variable "enable_nat_gateway" {
  description = "Enable NAT Gateway for private subnets"
  type        = bool
  default     = true
}

variable "enable_vpn_gateway" {
  description = "Enable VPN Gateway"
  type        = bool
  default     = false
}

locals {
  common_tags = {
    Project     = var.project_name
    Environment = var.environment
    Module      = "networking"
    ManagedBy   = "terraform"
  }

  # Calculate subnet CIDRs automatically
  public_subnet_cidrs  = [for i in range(length(var.availability_zones)) : cidrsubnet(var.vpc_cidr, 8, i)]
  private_subnet_cidrs = [for i in range(length(var.availability_zones)) : cidrsubnet(var.vpc_cidr, 8, i + 10)]
  database_subnet_cidrs = [for i in range(length(var.availability_zones)) : cidrsubnet(var.vpc_cidr, 8, i + 20)]
}

# VPC
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = merge(local.common_tags, {
    Name = "${var.project_name}-${var.environment}-vpc"
  })
}

# Internet Gateway
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = merge(local.common_tags, {
    Name = "${var.project_name}-${var.environment}-igw"
  })
}

# Public Subnets
resource "aws_subnet" "public" {
  count = length(var.availability_zones)

  vpc_id                  = aws_vpc.main.id
  cidr_block              = local.public_subnet_cidrs[count.index]
  availability_zone       = var.availability_zones[count.index]
  map_public_ip_on_launch = true

  tags = merge(local.common_tags, {
    Name = "${var.project_name}-${var.environment}-public-${count.index + 1}"
    Type = "public"
    "kubernetes.io/role/elb" = "1"
  })
}

# Private Subnets
resource "aws_subnet" "private" {
  count = length(var.availability_zones)

  vpc_id            = aws_vpc.main.id
  cidr_block        = local.private_subnet_cidrs[count.index]
  availability_zone = var.availability_zones[count.index]

  tags = merge(local.common_tags, {
    Name = "${var.project_name}-${var.environment}-private-${count.index + 1}"
    Type = "private"
    "kubernetes.io/role/internal-elb" = "1"
  })
}

# Database Subnets
resource "aws_subnet" "database" {
  count = length(var.availability_zones)

  vpc_id            = aws_vpc.main.id
  cidr_block        = local.database_subnet_cidrs[count.index]
  availability_zone = var.availability_zones[count.index]

  tags = merge(local.common_tags, {
    Name = "${var.project_name}-${var.environment}-database-${count.index + 1}"
    Type = "database"
  })
}

# NAT Gateways (conditional)
resource "aws_eip" "nat" {
  count = var.enable_nat_gateway ? length(var.availability_zones) : 0

  domain = "vpc"

  tags = merge(local.common_tags, {
    Name = "${var.project_name}-${var.environment}-nat-eip-${count.index + 1}"
  })
}

resource "aws_nat_gateway" "main" {
  count = var.enable_nat_gateway ? length(var.availability_zones) : 0

  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id

  tags = merge(local.common_tags, {
    Name = "${var.project_name}-${var.environment}-nat-${count.index + 1}"
  })

  depends_on = [aws_internet_gateway.main]
}

# Route Tables
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }

  tags = merge(local.common_tags, {
    Name = "${var.project_name}-${var.environment}-public-rt"
  })
}

resource "aws_route_table" "private" {
  count = length(var.availability_zones)

  vpc_id = aws_vpc.main.id

  dynamic "route" {
    for_each = var.enable_nat_gateway ? [1] : []
    content {
      cidr_block     = "0.0.0.0/0"
      nat_gateway_id = aws_nat_gateway.main[count.index].id
    }
  }

  tags = merge(local.common_tags, {
    Name = "${var.project_name}-${var.environment}-private-rt-${count.index + 1}"
  })
}

resource "aws_route_table" "database" {
  vpc_id = aws_vpc.main.id

  tags = merge(local.common_tags, {
    Name = "${var.project_name}-${var.environment}-database-rt"
  })
}

# Route Table Associations
resource "aws_route_table_association" "public" {
  count = length(var.availability_zones)

  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private" {
  count = length(var.availability_zones)

  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private[count.index].id
}

resource "aws_route_table_association" "database" {
  count = length(var.availability_zones)

  subnet_id      = aws_subnet.database[count.index].id
  route_table_id = aws_route_table.database.id
}

# Database Subnet Group
resource "aws_db_subnet_group" "main" {
  name       = "${var.project_name}-${var.environment}-db-subnet-group"
  subnet_ids = aws_subnet.database[*].id

  tags = merge(local.common_tags, {
    Name = "${var.project_name}-${var.environment}-db-subnet-group"
  })
}

# VPC Flow Logs
resource "aws_flow_log" "main" {
  iam_role_arn    = aws_iam_role.flow_log.arn
  log_destination = aws_cloudwatch_log_group.vpc_flow_log.arn
  traffic_type    = "ALL"
  vpc_id          = aws_vpc.main.id
}

resource "aws_cloudwatch_log_group" "vpc_flow_log" {
  name              = "/aws/vpc/flowlogs/${var.project_name}-${var.environment}"
  retention_in_days = 30

  tags = local.common_tags
}

resource "aws_iam_role" "flow_log" {
  name = "${var.project_name}-${var.environment}-flow-log-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "vpc-flow-logs.amazonaws.com"
        }
      }
    ]
  })

  tags = local.common_tags
}

resource "aws_iam_role_policy" "flow_log" {
  name = "${var.project_name}-${var.environment}-flow-log-policy"
  role = aws_iam_role.flow_log.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "logs:CreateLogGroup",
          "logs:CreateLogStream",
          "logs:PutLogEvents",
          "logs:DescribeLogGroups",
          "logs:DescribeLogStreams"
        ]
        Resource = "*"
      }
    ]
  })
}

# Outputs
output "vpc_id" {
  description = "ID of the VPC"
  value       = aws_vpc.main.id
}

output "vpc_cidr_block" {
  description = "CIDR block of the VPC"
  value       = aws_vpc.main.cidr_block
}

output "public_subnet_ids" {
  description = "IDs of the public subnets"
  value       = aws_subnet.public[*].id
}

output "private_subnet_ids" {
  description = "IDs of the private subnets"
  value       = aws_subnet.private[*].id
}

output "database_subnet_ids" {
  description = "IDs of the database subnets"
  value       = aws_subnet.database[*].id
}

output "database_subnet_group_name" {
  description = "Name of the database subnet group"
  value       = aws_db_subnet_group.main.name
}

output "internet_gateway_id" {
  description = "ID of the Internet Gateway"
  value       = aws_internet_gateway.main.id
}

output "nat_gateway_ids" {
  description = "IDs of the NAT Gateways"
  value       = aws_nat_gateway.main[*].id
}

Advanced Terraform State Management:

# terraform/environments/production/backend.tf - Remote state configuration
terraform {
  backend "s3" {
    bucket         = "myapp-terraform-state-production"
    key            = "infrastructure/production/terraform.tfstate"
    region         = "us-west-2"
    encrypt        = true
    dynamodb_table = "terraform-locks-production"

    # State locking and consistency checking
    skip_credentials_validation = false
    skip_metadata_api_check     = false
    skip_region_validation      = false
    force_path_style           = false
  }

  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.23"
    }
    helm = {
      source  = "hashicorp/helm"
      version = "~> 2.11"
    }
  }
}

# terraform/environments/production/main.tf - Environment-specific configuration
locals {
  environment = "production"
  region      = "us-west-2"

  # Environment-specific configurations
  cluster_config = {
    version = "1.28"
    node_groups = {
      main = {
        instance_types = ["m5.large", "m5.xlarge"]
        capacity_type  = "ON_DEMAND"
        min_size       = 3
        max_size       = 20
        desired_size   = 5
      }
      spot = {
        instance_types = ["m5.large", "m5.xlarge", "m5a.large", "m5a.xlarge"]
        capacity_type  = "SPOT"
        min_size       = 2
        max_size       = 15
        desired_size   = 3
      }
    }
  }

  database_config = {
    engine_version = "15.4"
    instance_class = "db.r5.xlarge"
    multi_az       = true
    backup_retention = 30
    storage_size   = 500
    max_storage    = 2000
  }

  redis_config = {
    node_type         = "cache.r6g.large"
    num_cache_clusters = 3
    automatic_failover = true
    multi_az          = true
  }
}

data "aws_availability_zones" "available" {
  state = "available"
}

# Networking Module
module "networking" {
  source = "../../modules/networking"

  project_name       = var.project_name
  environment        = local.environment
  region            = local.region
  availability_zones = slice(data.aws_availability_zones.available.names, 0, 3)
  vpc_cidr          = "10.0.0.0/16"
  enable_nat_gateway = true
  enable_vpn_gateway = false
}

# Security Module
module "security" {
  source = "../../modules/security"

  project_name = var.project_name
  environment  = local.environment
  vpc_id       = module.networking.vpc_id
  vpc_cidr     = module.networking.vpc_cidr_block
}

# EKS Cluster Module
module "eks" {
  source = "../../modules/eks"

  project_name          = var.project_name
  environment          = local.environment
  region               = local.region
  kubernetes_version   = local.cluster_config.version

  vpc_id              = module.networking.vpc_id
  subnet_ids          = module.networking.private_subnet_ids
  security_group_ids  = [module.security.cluster_security_group_id]

  node_groups = local.cluster_config.node_groups

  # Add-ons
  enable_cluster_autoscaler = true
  enable_aws_load_balancer_controller = true
  enable_external_dns = true
  enable_cert_manager = true

  tags = {
    Environment = local.environment
    Terraform   = "true"
  }
}

# Database Module
module "database" {
  source = "../../modules/database"

  project_name   = var.project_name
  environment    = local.environment

  vpc_id                = module.networking.vpc_id
  subnet_group_name     = module.networking.database_subnet_group_name
  security_group_ids    = [module.security.database_security_group_id]

  engine_version       = local.database_config.engine_version
  instance_class       = local.database_config.instance_class
  allocated_storage    = local.database_config.storage_size
  max_allocated_storage = local.database_config.max_storage

  multi_az = local.database_config.multi_az
  backup_retention_period = local.database_config.backup_retention

  # Performance Insights
  performance_insights_enabled = true
  performance_insights_retention_period = 7

  # Enhanced Monitoring
  monitoring_interval = 60
  monitoring_role_arn = aws_iam_role.rds_enhanced_monitoring.arn

  tags = {
    Environment = local.environment
    Terraform   = "true"
  }
}

# Enhanced Monitoring IAM Role for RDS
resource "aws_iam_role" "rds_enhanced_monitoring" {
  name = "${var.project_name}-${local.environment}-rds-monitoring-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "monitoring.rds.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "rds_enhanced_monitoring" {
  role       = aws_iam_role.rds_enhanced_monitoring.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonRDSEnhancedMonitoringRole"
}

# Redis Module
module "redis" {
  source = "../../modules/redis"

  project_name = var.project_name
  environment  = local.environment

  vpc_id             = module.networking.vpc_id
  subnet_ids         = module.networking.private_subnet_ids
  security_group_ids = [module.security.redis_security_group_id]

  node_type               = local.redis_config.node_type
  num_cache_clusters      = local.redis_config.num_cache_clusters
  automatic_failover_enabled = local.redis_config.automatic_failover
  multi_az_enabled        = local.redis_config.multi_az

  # Security
  at_rest_encryption_enabled = true
  transit_encryption_enabled = true
  auth_token_enabled        = true

  # Backup
  snapshot_retention_limit = 7
  snapshot_window         = "03:00-05:00"

  tags = {
    Environment = local.environment
    Terraform   = "true"
  }
}

# Monitoring Module
module "monitoring" {
  source = "../../modules/monitoring"

  project_name = var.project_name
  environment  = local.environment

  cluster_name = module.eks.cluster_name
  vpc_id       = module.networking.vpc_id

  # SNS Topics for alerts
  create_sns_topics = true
  alert_email = var.alert_email

  # CloudWatch Dashboard
  create_dashboard = true

  # Log retention
  log_retention_days = 30

  tags = {
    Environment = local.environment
    Terraform   = "true"
  }
}

# Outputs
output "cluster_endpoint" {
  description = "EKS cluster endpoint"
  value       = module.eks.cluster_endpoint
  sensitive   = true
}

output "cluster_name" {
  description = "EKS cluster name"
  value       = module.eks.cluster_name
}

output "database_endpoint" {
  description = "RDS database endpoint"
  value       = module.database.endpoint
  sensitive   = true
}

output "redis_endpoint" {
  description = "Redis cluster endpoint"
  value       = module.redis.configuration_endpoint
  sensitive   = true
}

output "vpc_id" {
  description = "VPC ID"
  value       = module.networking.vpc_id
}

Terraform Workspace Management:

#!/bin/bash
# terraform-management.sh - Professional Terraform workflow automation

set -euo pipefail

PROJECT_NAME="${PROJECT_NAME:-myapp}"
TERRAFORM_VERSION="${TERRAFORM_VERSION:-1.5.7}"
AWS_REGION="${AWS_REGION:-us-west-2}"

# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

log() {
    echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')] $1${NC}"
}

warn() {
    echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] WARNING: $1${NC}"
}

error() {
    echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ERROR: $1${NC}"
    exit 1
}

check_prerequisites() {
    log "Checking prerequisites..."

    # Check Terraform installation
    if ! command -v terraform &> /dev/null; then
        error "Terraform is not installed"
    fi

    local tf_version=$(terraform version -json | jq -r '.terraform_version')
    if [[ "$tf_version" != "$TERRAFORM_VERSION" ]]; then
        warn "Expected Terraform $TERRAFORM_VERSION, found $tf_version"
    fi

    # Check AWS CLI
    if ! command -v aws &> /dev/null; then
        error "AWS CLI is not installed"
    fi

    # Verify AWS credentials
    if ! aws sts get-caller-identity &> /dev/null; then
        error "AWS credentials not configured or invalid"
    fi

    log "Prerequisites check passed"
}

setup_terraform_backend() {
    local environment="$1"

    log "Setting up Terraform backend for $environment..."

    # Create S3 bucket for state if it doesn't exist
    local bucket_name="${PROJECT_NAME}-terraform-state-${environment}"
    if ! aws s3 ls "s3://$bucket_name" &> /dev/null; then
        log "Creating S3 bucket: $bucket_name"
        aws s3 mb "s3://$bucket_name" --region "$AWS_REGION"

        # Enable versioning
        aws s3api put-bucket-versioning \
            --bucket "$bucket_name" \
            --versioning-configuration Status=Enabled

        # Enable server-side encryption
        aws s3api put-bucket-encryption \
            --bucket "$bucket_name" \
            --server-side-encryption-configuration '{
                "Rules": [{
                    "ApplyServerSideEncryptionByDefault": {
                        "SSEAlgorithm": "AES256"
                    }
                }]
            }'

        # Block public access
        aws s3api put-public-access-block \
            --bucket "$bucket_name" \
            --public-access-block-configuration \
                BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true
    fi

    # Create DynamoDB table for locking if it doesn't exist
    local table_name="terraform-locks-${environment}"
    if ! aws dynamodb describe-table --table-name "$table_name" &> /dev/null; then
        log "Creating DynamoDB table: $table_name"
        aws dynamodb create-table \
            --table-name "$table_name" \
            --attribute-definitions AttributeName=LockID,AttributeType=S \
            --key-schema AttributeName=LockID,KeyType=HASH \
            --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5 \
            --region "$AWS_REGION"

        # Wait for table to be active
        aws dynamodb wait table-exists --table-name "$table_name"
    fi

    log "Terraform backend setup completed for $environment"
}

terraform_plan() {
    local environment="$1"
    local workspace_dir="terraform/environments/$environment"

    if [[ ! -d "$workspace_dir" ]]; then
        error "Environment directory not found: $workspace_dir"
    fi

    cd "$workspace_dir"

    log "Planning Terraform changes for $environment..."

    # Initialize if needed
    terraform init -upgrade

    # Validate configuration
    terraform validate

    # Format check
    if ! terraform fmt -check -recursive; then
        warn "Terraform files are not properly formatted. Run 'terraform fmt -recursive' to fix."
    fi

    # Security scan with tfsec
    if command -v tfsec &> /dev/null; then
        log "Running security scan with tfsec..."
        tfsec . --format=junit --out=tfsec-report.xml || warn "Security issues found"
    fi

    # Cost estimation with Infracost
    if command -v infracost &> /dev/null && [[ -n "${INFRACOST_API_KEY:-}" ]]; then
        log "Generating cost estimate..."
        infracost breakdown --path . --format json --out-file infracost.json
        infracost diff --path . --format table
    fi

    # Plan with detailed output
    terraform plan \
        -detailed-exitcode \
        -out="tfplan-$(date +%Y%m%d-%H%M%S).plan" \
        -var="project_name=$PROJECT_NAME" \
        -var-file="terraform.tfvars"

    local plan_exit_code=$?

    if [[ $plan_exit_code -eq 1 ]]; then
        error "Terraform plan failed"
    elif [[ $plan_exit_code -eq 2 ]]; then
        log "Terraform plan completed with changes"
    else
        log "Terraform plan completed - no changes"
    fi

    cd - > /dev/null
}

terraform_apply() {
    local environment="$1"
    local workspace_dir="terraform/environments/$environment"
    local auto_approve="${2:-false}"

    if [[ ! -d "$workspace_dir" ]]; then
        error "Environment directory not found: $workspace_dir"
    fi

    cd "$workspace_dir"

    log "Applying Terraform changes for $environment..."

    # Find the latest plan file
    local plan_file=$(ls -t tfplan-*.plan 2>/dev/null | head -n1 || echo "")

    if [[ -z "$plan_file" ]]; then
        warn "No plan file found. Running plan first..."
        terraform plan -var="project_name=$PROJECT_NAME" -var-file="terraform.tfvars"
    fi

    # Apply changes
    local apply_args=(-var="project_name=$PROJECT_NAME" -var-file="terraform.tfvars")

    if [[ "$auto_approve" == "true" ]]; then
        apply_args+=(-auto-approve)
    fi

    if [[ -n "$plan_file" ]]; then
        terraform apply "${apply_args[@]}" "$plan_file"
    else
        terraform apply "${apply_args[@]}"
    fi

    # Clean up old plan files
    find . -name "tfplan-*.plan" -mtime +7 -delete

    log "Terraform apply completed for $environment"

    # Output important values
    log "Retrieving outputs..."
    terraform output -json > "outputs-$(date +%Y%m%d-%H%M%S).json"

    cd - > /dev/null
}

terraform_destroy() {
    local environment="$1"
    local workspace_dir="terraform/environments/$environment"

    if [[ ! -d "$workspace_dir" ]]; then
        error "Environment directory not found: $workspace_dir"
    fi

    if [[ "$environment" == "production" ]]; then
        error "Destruction of production environment requires manual confirmation"
    fi

    cd "$workspace_dir"

    warn "This will DESTROY all resources in $environment environment!"
    read -p "Are you absolutely sure? Type 'yes' to confirm: " confirmation

    if [[ "$confirmation" != "yes" ]]; then
        log "Destruction cancelled"
        cd - > /dev/null
        return
    fi

    log "Destroying Terraform resources for $environment..."

    terraform destroy \
        -var="project_name=$PROJECT_NAME" \
        -var-file="terraform.tfvars"

    log "Terraform destroy completed for $environment"

    cd - > /dev/null
}

state_management() {
    local action="$1"
    local environment="$2"
    local workspace_dir="terraform/environments/$environment"

    if [[ ! -d "$workspace_dir" ]]; then
        error "Environment directory not found: $workspace_dir"
    fi

    cd "$workspace_dir"

    case "$action" in
        "list")
            log "Listing Terraform state resources for $environment..."
            terraform state list
            ;;
        "show")
            local resource="$3"
            if [[ -z "$resource" ]]; then
                error "Resource name required for show command"
            fi
            terraform state show "$resource"
            ;;
        "pull")
            log "Pulling remote state for $environment..."
            terraform state pull > "state-backup-$(date +%Y%m%d-%H%M%S).json"
            log "State backed up locally"
            ;;
        "refresh")
            log "Refreshing Terraform state for $environment..."
            terraform refresh -var="project_name=$PROJECT_NAME" -var-file="terraform.tfvars"
            ;;
        *)
            error "Unknown state action: $action"
            ;;
    esac

    cd - > /dev/null
}

# Main command router
case "${1:-help}" in
    "init")
        check_prerequisites
        setup_terraform_backend "${2:-staging}"
        ;;
    "plan")
        check_prerequisites
        terraform_plan "${2:-staging}"
        ;;
    "apply")
        check_prerequisites
        terraform_apply "${2:-staging}" "${3:-false}"
        ;;
    "destroy")
        check_prerequisites
        terraform_destroy "${2:-staging}"
        ;;
    "state")
        check_prerequisites
        state_management "${2:-list}" "${3:-staging}" "${4:-}"
        ;;
    "help"|*)
        cat << EOF
Terraform Infrastructure Management

Usage: $0 <command> [options]

Commands:
    init <environment>              Initialize Terraform backend
    plan <environment>              Plan infrastructure changes
    apply <environment> [auto]      Apply infrastructure changes
    destroy <environment>           Destroy infrastructure (staging only)
    state <action> <environment>    Manage Terraform state

State Actions:
    list                           List all resources
    show <resource>                Show resource details
    pull                           Backup state locally
    refresh                        Refresh state from real infrastructure

Environments:
    staging                        Staging environment
    production                     Production environment

Examples:
    $0 init staging
    $0 plan production
    $0 apply staging auto
    $0 state list production
    $0 state show production aws_vpc.main
EOF
        ;;
esac

Monitoring and Alerting: Proactive Operations Excellence

From Reactive Firefighting to Proactive Problem Prevention

Understanding the monitoring maturity model:

// Monitoring evolution: Reactive to Predictive Operations
const monitoringMaturity = {
  level1_Reactive: {
    approach: "Monitor basic uptime, react to customer complaints",
    characteristics: [
      "Ping monitoring for basic availability",
      "Manual log checking when problems occur",
      "Alerts come from angry customers on social media",
      "Debugging happens during outages",
    ],
    problems: [
      "Issues discovered hours after they occur",
      "No visibility into performance degradation",
      "Root cause analysis takes days",
      "Customer experience suffers",
    ],
    reality: "You're always one step behind problems",
  },

  level2_Proactive: {
    approach: "Monitor key metrics, alert before customer impact",
    characteristics: [
      "Application performance monitoring (APM)",
      "Infrastructure metrics and thresholds",
      "Automated alerts for critical issues",
      "Basic dashboards and visualization",
    ],
    benefits: [
      "Issues detected before customer complaints",
      "Faster mean time to resolution (MTTR)",
      "Better understanding of system behavior",
      "Reduced firefighting stress",
    ],
    limitations: "Still reactive to symptoms, not causes",
  },

  level3_Predictive: {
    approach: "Predict problems, prevent outages, optimize performance",
    characteristics: [
      "Machine learning-based anomaly detection",
      "Predictive alerting based on trends",
      "Automatic remediation for known issues",
      "Comprehensive observability platform",
    ],
    advantages: [
      "Problems prevented before they occur",
      "Automatic scaling and optimization",
      "Data-driven capacity planning",
      "Continuous performance improvement",
    ],
    outcome: "Operations become invisible to customers",
  },
};

Comprehensive Monitoring Stack with Prometheus:

# monitoring/prometheus/prometheus.yml - Production Prometheus configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: "production"
    region: "us-west-2"
    environment: "production"

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093
      path_prefix: /alertmanager
      scheme: http

# Rules for alerts and recording rules
rule_files:
  - "/etc/prometheus/rules/*.yml"
  - "/etc/prometheus/alerts/*.yml"

# Scrape configurations
scrape_configs:
  # Prometheus self-monitoring
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
    scrape_interval: 5s
    metrics_path: /metrics

  # Node Exporter for system metrics
  - job_name: "node-exporter"
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - monitoring
    relabel_configs:
      - source_labels: [__meta_kubernetes_endpoints_name]
        action: keep
        regex: node-exporter
      - source_labels: [__meta_kubernetes_endpoint_address_target_name]
        target_label: node
      - source_labels: [__meta_kubernetes_pod_node_name]
        target_label: instance

  # Kubernetes API Server
  - job_name: "kubernetes-apiservers"
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - default
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecure_skip_verify: true
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels:
          [
            __meta_kubernetes_namespace,
            __meta_kubernetes_service_name,
            __meta_kubernetes_endpoint_port_name,
          ]
        action: keep
        regex: default;kubernetes;https

  # Kubernetes nodes (kubelet)
  - job_name: "kubernetes-nodes"
    kubernetes_sd_configs:
      - role: node
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecure_skip_verify: true
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics

  # Application pods
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

  # Database monitoring (PostgreSQL)
  - job_name: "postgres-exporter"
    static_configs:
      - targets: ["postgres-exporter:9187"]
    scrape_interval: 30s

  # Redis monitoring
  - job_name: "redis-exporter"
    static_configs:
      - targets: ["redis-exporter:9121"]
    scrape_interval: 30s

  # AWS CloudWatch metrics
  - job_name: "cloudwatch-exporter"
    static_configs:
      - targets: ["cloudwatch-exporter:9106"]
    scrape_interval: 60s

  # Application-specific metrics
  - job_name: "myapp"
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
            - production
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels:
          [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name

# Remote write configuration for long-term storage
remote_write:
  - url: "https://prometheus-prod.monitoring.myapp.com/api/v1/write"
    queue_config:
      max_samples_per_send: 1000
      max_shards: 200
      capacity: 2500

Advanced Alerting Rules:

# monitoring/prometheus/alerts/application-alerts.yml - Comprehensive alerting rules
groups:
  - name: application.rules
    rules:
      # ========================================
      # Application Availability Alerts
      # ========================================
      - alert: ApplicationDown
        expr: up{job="myapp"} == 0
        for: 30s
        labels:
          severity: critical
          team: platform
          service: "{{ $labels.kubernetes_name }}"
        annotations:
          summary: "Application {{ $labels.instance }} is down"
          description: |
            Application {{ $labels.kubernetes_name }} in namespace {{ $labels.kubernetes_namespace }} 
            has been down for more than 30 seconds.

            Instance: {{ $labels.instance }}
            Job: {{ $labels.job }}
          runbook_url: "https://runbooks.myapp.com/application-down"
          dashboard_url: "https://grafana.myapp.com/d/app-overview"

      - alert: ApplicationHighErrorRate
        expr: |
          (
            rate(http_requests_total{job="myapp", status=~"5.."}[5m]) /
            rate(http_requests_total{job="myapp"}[5m])
          ) > 0.05
        for: 2m
        labels:
          severity: warning
          team: platform
          service: "{{ $labels.kubernetes_name }}"
        annotations:
          summary: "High error rate detected for {{ $labels.service }}"
          description: |
            Application {{ $labels.service }} is experiencing {{ $value | humanizePercentage }} error rate
            for more than 2 minutes.

            Current error rate: {{ $value | humanizePercentage }}
            Threshold: 5%
          runbook_url: "https://runbooks.myapp.com/high-error-rate"

      - alert: ApplicationHighLatency
        expr: |
          histogram_quantile(0.95, 
            rate(http_request_duration_seconds_bucket{job="myapp"}[5m])
          ) > 0.5
        for: 2m
        labels:
          severity: warning
          team: platform
          service: "{{ $labels.kubernetes_name }}"
        annotations:
          summary: "High latency detected for {{ $labels.service }}"
          description: |
            Application {{ $labels.service }} 95th percentile latency is {{ $value }}s
            for more than 2 minutes.

            Current P95 latency: {{ $value }}s
            Threshold: 0.5s
          runbook_url: "https://runbooks.myapp.com/high-latency"

      # ========================================
      # Infrastructure Alerts
      # ========================================
      - alert: HighCPUUsage
        expr: |
          (
            100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
          ) > 80
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: |
            CPU usage on {{ $labels.instance }} has been above 80% for more than 5 minutes.

            Current usage: {{ $value | humanizePercentage }}
            Threshold: 80%
          runbook_url: "https://runbooks.myapp.com/high-cpu"

      - alert: HighMemoryUsage
        expr: |
          (
            1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
          ) > 0.85
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: |
            Memory usage on {{ $labels.instance }} has been above 85% for more than 5 minutes.

            Current usage: {{ $value | humanizePercentage }}
            Available: {{ with query "node_memory_MemAvailable_bytes{instance=\"" }}{{ . | first | value | humanize1024 }}B{{ end }}
            Total: {{ with query "node_memory_MemTotal_bytes{instance=\"" }}{{ . | first | value | humanize1024 }}B{{ end }}
          runbook_url: "https://runbooks.myapp.com/high-memory"

      - alert: DiskSpaceLow
        expr: |
          (
            100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100
          ) > 85
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: |
            Disk usage on {{ $labels.instance }}:{{ $labels.mountpoint }} is {{ $value | humanizePercentage }}.

            Available: {{ with query "node_filesystem_avail_bytes{instance=\"" }}{{ . | first | value | humanize1024 }}B{{ end }}
            Total: {{ with query "node_filesystem_size_bytes{instance=\"" }}{{ . | first | value | humanize1024 }}B{{ end }}
          runbook_url: "https://runbooks.myapp.com/disk-space-low"

      # ========================================
      # Database Alerts
      # ========================================
      - alert: DatabaseConnectionsHigh
        expr: |
          (pg_stat_activity_count / pg_settings_max_connections) > 0.8
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High database connections on {{ $labels.instance }}"
          description: |
            Database {{ $labels.instance }} is using {{ $value | humanizePercentage }} of max connections.

            Current connections: {{ with query "pg_stat_activity_count{instance=\"" }}{{ . | first | value }}{{ end }}
            Max connections: {{ with query "pg_settings_max_connections{instance=\"" }}{{ . | first | value }}{{ end }}
          runbook_url: "https://runbooks.myapp.com/db-connections-high"

      - alert: DatabaseSlowQueries
        expr: |
          rate(pg_stat_statements_mean_time_ms[5m]) > 1000
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Slow database queries detected on {{ $labels.instance }}"
          description: |
            Database {{ $labels.instance }} has queries with average execution time of {{ $value }}ms.

            Query: {{ $labels.query }}
            Average time: {{ $value }}ms
          runbook_url: "https://runbooks.myapp.com/slow-queries"

      # ========================================
      # Kubernetes Alerts
      # ========================================
      - alert: KubernetesPodCrashLooping
        expr: |
          rate(kube_pod_container_status_restarts_total[10m]) > 0
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
          description: |
            Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container }} 
            is restarting {{ $value }} times per second.
          runbook_url: "https://runbooks.myapp.com/pod-crash-loop"

      - alert: KubernetesNodeNotReady
        expr: |
          kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Kubernetes node {{ $labels.node }} is not ready"
          description: |
            Node {{ $labels.node }} has been in NotReady state for more than 5 minutes.
          runbook_url: "https://runbooks.myapp.com/node-not-ready"

      - alert: KubernetesPodPending
        expr: |
          kube_pod_status_phase{phase="Pending"} == 1
        for: 10m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} stuck in Pending"
          description: |
            Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in Pending state for more than 10 minutes.
            This usually indicates resource constraints or scheduling issues.
          runbook_url: "https://runbooks.myapp.com/pod-pending"

  - name: sli.rules
    interval: 30s
    rules:
      # ========================================
      # SLI (Service Level Indicator) Recording Rules
      # ========================================
      - record: sli:http_requests:rate5m
        expr: |
          sum(rate(http_requests_total[5m])) by (service, method, status)

      - record: sli:http_request_duration:p95_5m
        expr: |
          histogram_quantile(0.95, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
          )

      - record: sli:http_request_duration:p99_5m
        expr: |
          histogram_quantile(0.99, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
          )

      - record: sli:availability:5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) by (service) /
          sum(rate(http_requests_total[5m])) by (service)

      - record: sli:error_rate:5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
          sum(rate(http_requests_total[5m])) by (service)

Alertmanager Configuration for Intelligent Routing:

# monitoring/alertmanager/alertmanager.yml - Professional alert routing
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@myapp.com'
  smtp_auth_username: 'alerts@myapp.com'
  smtp_auth_password_file: '/etc/alertmanager/smtp_password'

# Templates for alert formatting
templates:
  - '/etc/alertmanager/templates/*.tmpl'

# Alert routing tree
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'default'

  # Routing based on severity and team
  routes:
    # Critical alerts - immediate notification
    - match:
        severity: critical
      group_wait: 10s
      group_interval: 1m
      repeat_interval: 5m
      receiver: 'critical-alerts'
      routes:
        # Database critical issues
        - match_re:
            alertname: 'Database.*'
        receiver: 'database-critical'
        # Application down
        - match:
            alertname: 'ApplicationDown'
          receiver: 'application-critical'

    # Warning alerts - less urgent
    - match:
        severity: warning
      group_wait: 2m
      group_interval: 10m
      repeat_interval: 4h
      receiver: 'warning-alerts'
      routes:
        # Performance issues
        - match_re:
            alertname: '.*HighLatency|.*HighErrorRate'
          receiver: 'performance-team'
        # Infrastructure issues
        - match_re:
            alertname: 'High.*Usage|.*DiskSpace.*'
          receiver: 'infrastructure-team'

    # Business hours only alerts
    - match:
        severity: info
      group_wait: 5m
      group_interval: 30m
      repeat_interval: 24h
      receiver: 'info-alerts'
      active_time_intervals:
        - 'business-hours'

# Receivers define how alerts are sent
receivers:
  - name: 'default'
    slack_configs:
      - api_url_file: '/etc/alertmanager/slack_webhook'
        channel: '#alerts'
        title: 'Default Alert'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          {{ end }}

  - name: 'critical-alerts'
    # Multiple notification channels for critical alerts
    slack_configs:
      - api_url_file: '/etc/alertmanager/slack_webhook'
        channel: '#critical-alerts'
        title: '🚨 CRITICAL ALERT 🚨'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Service:* {{ .Labels.service }}
          *Description:* {{ .Annotations.description }}
          *Runbook:* {{ .Annotations.runbook_url }}
          *Dashboard:* {{ .Annotations.dashboard_url }}
          {{ end }}
        actions:
          - type: button
            text: 'View Runbook'
            url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
          - type: button
            text: 'View Dashboard'
            url: '{{ (index .Alerts 0).Annotations.dashboard_url }}'

    email_configs:
      - to: 'oncall@myapp.com'
        subject: '🚨 CRITICAL: {{ (index .Alerts 0).Annotations.summary }}'
        html: |
          <h2>Critical Alert Triggered</h2>
          {{ range .Alerts }}
          <h3>{{ .Annotations.summary }}</h3>
          <p><strong>Service:</strong> {{ .Labels.service }}</p>
          <p><strong>Description:</strong> {{ .Annotations.description }}</p>
          <p><strong>Runbook:</strong> <a href="{{ .Annotations.runbook_url }}">{{ .Annotations.runbook_url }}</a></p>
          <p><strong>Dashboard:</strong> <a href="{{ .Annotations.dashboard_url }}">{{ .Annotations.dashboard_url }}</a></p>
          {{ end }}

    # PagerDuty integration for critical alerts
    pagerduty_configs:
      - routing_key_file: '/etc/alertmanager/pagerduty_key'
        description: '{{ (index .Alerts 0).Annotations.summary }}'
        details:
          service: '{{ (index .Alerts 0).Labels.service }}'
          severity: '{{ (index .Alerts 0).Labels.severity }}'
          runbook: '{{ (index .Alerts 0).Annotations.runbook_url }}'

  - name: 'database-critical'
    slack_configs:
      - api_url_file: '/etc/alertmanager/slack_webhook'
        channel: '#database-alerts'
        title: '🗄️ DATABASE CRITICAL ALERT'
        text: |
          {{ range .Alerts }}
          *Database Alert:* {{ .Annotations.summary }}
          *Instance:* {{ .Labels.instance }}
          *Description:* {{ .Annotations.description }}
          {{ end }}

    email_configs:
      - to: 'dba-team@myapp.com,oncall@myapp.com'
        subject: 'DATABASE CRITICAL: {{ (index .Alerts 0).Annotations.summary }}'

  - name: 'performance-team'
    slack_configs:
      - api_url_file: '/etc/alertmanager/slack_webhook'
        channel: '#performance'
        title: '⚡ Performance Alert'
        text: |
          {{ range .Alerts }}
          *Performance Issue:* {{ .Annotations.summary }}
          *Service:* {{ .Labels.service }}
          *Current Value:* {{ .Annotations.current_value }}
          {{ end }}

  - name: 'infrastructure-team'
    slack_configs:
      - api_url_file: '/etc/alertmanager/slack_webhook'
        channel: '#infrastructure'
        title: '🏗️ Infrastructure Alert'
        text: |
          {{ range .Alerts }}
          *Infrastructure Issue:* {{ .Annotations.summary }}
          *Node:* {{ .Labels.instance }}
          *Details:* {{ .Annotations.description }}
          {{ end }}

# Inhibition rules - suppress certain alerts when others are firing
inhibit_rules:
  # Suppress all other alerts when ApplicationDown is firing
  - source_match:
      alertname: 'ApplicationDown'
    target_match_re:
      alertname: '.*HighLatency|.*HighErrorRate'
    equal: ['service']

  # Suppress node alerts when entire node is down
  - source_match:
      alertname: 'KubernetesNodeNotReady'
    target_match_re:
      alertname: 'High.*Usage'
    equal: ['instance']

# Time intervals for business hours alerting
time_intervals:
  - name: 'business-hours'
    time_intervals:
      - times:
          - start_time: '09:00'
            end_time: '18:00'
        weekdays: ['monday:friday']
        location: 'America/New_York'

Log Aggregation and Analysis: Making Sense of System Behavior

From Log Chaos to Operational Intelligence

Understanding the log management evolution:

// Log management maturity: From chaos to intelligence
const logManagementEvolution = {
  chaosStage: {
    approach: "Logs scattered across servers, manual grep when things break",
    characteristics: [
      "SSH into servers to read log files",
      "No standardized logging format",
      "Logs rotated and lost regularly",
      "Debugging requires accessing multiple servers",
    ],
    problems: [
      "Root cause analysis takes hours or days",
      "No correlation between different services",
      "Historical data lost due to rotation",
      "Debugging distributed systems is impossible",
    ],
    reality:
      "Logs are write-only data - you collect them but can't use them effectively",
  },

  centralizedStage: {
    approach: "All logs flow to central system, searchable and retained",
    characteristics: [
      "Centralized log collection (ELK, Fluentd)",
      "Structured logging with consistent formats",
      "Search and filter capabilities",
      "Log retention and archival policies",
    ],
    benefits: [
      "Single place to search all logs",
      "Better troubleshooting capabilities",
      "Historical analysis possible",
      "Correlation across services",
    ],
    limitations: "Still reactive - logs are used after problems occur",
  },

  intelligentStage: {
    approach: "Logs become operational intelligence, proactive insights",
    characteristics: [
      "Real-time log analysis and alerting",
      "Machine learning for anomaly detection",
      "Automatic correlation and pattern recognition",
      "Integration with metrics and traces (observability)",
    ],
    advantages: [
      "Proactive problem detection from log patterns",
      "Automatic root cause analysis",
      "Predictive insights from log trends",
      "Complete system observability",
    ],
    outcome: "Logs become a proactive operational tool, not just debugging aid",
  },
};

ELK Stack Implementation for Production:

# logging/elasticsearch/elasticsearch.yml - Production Elasticsearch cluster
cluster.name: "myapp-logs-production"
node.name: "${HOSTNAME}"
node.roles: [data, master, ingest]

# Network settings
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300

# Discovery for cluster formation
discovery.seed_hosts:
  - "elasticsearch-0.elasticsearch-headless.logging.svc.cluster.local"
  - "elasticsearch-1.elasticsearch-headless.logging.svc.cluster.local"
  - "elasticsearch-2.elasticsearch-headless.logging.svc.cluster.local"

cluster.initial_master_nodes:
  - "elasticsearch-0"
  - "elasticsearch-1"
  - "elasticsearch-2"

# Performance settings
bootstrap.memory_lock: true
indices.memory.index_buffer_size: 20%
indices.memory.min_index_buffer_size: 96mb

# Security settings
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: certs/elastic-certificates.p12

# Monitoring
xpack.monitoring.enabled: true
xpack.monitoring.collection.enabled: true

# Index lifecycle management
xpack.ilm.enabled: true

# Machine learning (for anomaly detection)
xpack.ml.enabled: true
xpack.ml.max_machine_memory_percent: 30
# logging/logstash/logstash.yml - Logstash pipeline configuration
node.name: "logstash-${HOSTNAME}"
path.data: /usr/share/logstash/data
path.config: /usr/share/logstash/pipeline
path.logs: /usr/share/logstash/logs

# Pipeline settings
pipeline.workers: 4
pipeline.batch.size: 2000
pipeline.batch.delay: 50

# Queue settings for reliability
queue.type: persisted
queue.max_bytes: 8gb
queue.checkpoint.writes: 1024

# Monitoring
monitoring.enabled: true
monitoring.elasticsearch.hosts:
  - "https://elasticsearch:9200"
monitoring.elasticsearch.username: "logstash_system"
monitoring.elasticsearch.password: "${LOGSTASH_SYSTEM_PASSWORD}"

# Dead letter queue
dead_letter_queue.enable: true
dead_letter_queue.max_bytes: 2gb
# ========================================
# Pipeline Configuration
# ========================================
# logging/logstash/pipeline/main.conf - Comprehensive log processing pipeline

input {
  # ========================================
  # Application Logs via Filebeat
  # ========================================
  beats {
    port => 5044
    ssl => true
    ssl_certificate_authorities => ["/usr/share/logstash/config/certs/ca.crt"]
    ssl_certificate => "/usr/share/logstash/config/certs/logstash.crt"
    ssl_key => "/usr/share/logstash/config/certs/logstash.key"
    ssl_verify_mode => "force_peer"
  }

  # ========================================
  # Kubernetes Logs via Fluent Bit
  # ========================================
  http {
    port => 8080
    codec => "json"
    additional_codecs => {
      "application/json" => "json"
    }
  }

  # ========================================
  # Database Logs (PostgreSQL)
  # ========================================
  jdbc {
    jdbc_driver_library => "/usr/share/logstash/lib/postgresql.jar"
    jdbc_driver_class => "org.postgresql.Driver"
    jdbc_connection_string => "jdbc:postgresql://postgres:5432/logs"
    jdbc_user => "${POSTGRES_USER}"
    jdbc_password => "${POSTGRES_PASSWORD}"
    schedule => "*/5 * * * * *"
    statement => "
      SELECT log_time, user_name, database_name, process_id,
             connection_from, session_id, session_line_num, command_tag,
             session_start_time, virtual_transaction_id, transaction_id,
             error_severity, sql_state_code, message, detail, hint,
             internal_query, internal_query_pos, context, query, query_pos,
             location, application_name
      FROM postgres_log
      WHERE log_time > :sql_last_value
      ORDER BY log_time ASC"
    use_column_value => true
    tracking_column => "log_time"
    tracking_column_type => "timestamp"
  }

  # ========================================
  # AWS CloudWatch Logs
  # ========================================
  cloudwatch_logs {
    log_group => [
      "/aws/lambda/myapp-*",
      "/aws/apigateway/myapp",
      "/aws/rds/instance/myapp-production/error"
    ]
    region => "us-west-2"
    aws_credentials_file => "/usr/share/logstash/config/aws_credentials"
    interval => 60
    start_position => "end"
  }
}

filter {
  # ========================================
  # Parse and Enrich Application Logs
  # ========================================
  if [fields][log_type] == "application" {
    # Parse JSON application logs
    json {
      source => "message"
      target => "app"
    }

    # Extract timestamp
    date {
      match => [ "[app][timestamp]", "ISO8601" ]
      target => "@timestamp"
    }

    # Parse log level
    mutate {
      add_field => { "log_level" => "%{[app][level]}" }
      add_field => { "service_name" => "%{[app][service]}" }
      add_field => { "trace_id" => "%{[app][traceId]}" }
      add_field => { "span_id" => "%{[app][spanId]}" }
    }

    # Detect error patterns
    if [app][level] == "error" or [app][level] == "fatal" {
      mutate {
        add_tag => [ "error", "alert" ]
      }

      # Extract stack trace
      if [app][stack] {
        mutate {
          add_field => { "error_stack" => "%{[app][stack]}" }
        }
      }
    }

    # Parse HTTP request logs
    if [app][http] {
      mutate {
        add_field => { "http_method" => "%{[app][http][method]}" }
        add_field => { "http_status" => "%{[app][http][status]}" }
        add_field => { "http_path" => "%{[app][http][path]}" }
        add_field => { "response_time" => "%{[app][http][responseTime]}" }
        add_field => { "user_agent" => "%{[app][http][userAgent]}" }
        add_field => { "client_ip" => "%{[app][http][clientIp]}" }
      }

      # Convert response time to number
      mutate {
        convert => { "response_time" => "float" }
        convert => { "http_status" => "integer" }
      }

      # Tag slow requests
      if [response_time] and [response_time] > 2000 {
        mutate {
          add_tag => [ "slow_request" ]
        }
      }

      # Tag error responses
      if [http_status] >= 400 {
        mutate {
          add_tag => [ "http_error" ]
        }
      }
    }
  }

  # ========================================
  # Parse Nginx Access Logs
  # ========================================
  if [fields][log_type] == "nginx" {
    grok {
      match => {
        "message" => "%{NGINXACCESS}"
      }
    }

    date {
      match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
    }

    mutate {
      convert => {
        "response" => "integer"
        "bytes" => "integer"
        "responsetime" => "float"
      }
    }

    # GeoIP enrichment
    geoip {
      source => "clientip"
      target => "geoip"
    }

    # User agent parsing
    useragent {
      source => "agent"
      target => "user_agent"
    }
  }

  # ========================================
  # Parse Database Logs
  # ========================================
  if [fields][log_type] == "database" {
    # Parse PostgreSQL logs
    grok {
      match => {
        "message" => "%{TIMESTAMP_ISO8601:timestamp} \[%{DATA:process_id}\] %{WORD:log_level}: %{GREEDYDATA:log_message}"
      }
    }

    # Extract slow query information
    if [log_message] =~ /duration: (\d+\.\d+) ms/ {
      grok {
        match => {
          "log_message" => "duration: %{NUMBER:query_duration:float} ms.*statement: %{GREEDYDATA:sql_query}"
        }
      }

      if [query_duration] and [query_duration] > 1000 {
        mutate {
          add_tag => [ "slow_query" ]
        }
      }
    }

    # Extract connection information
    if [log_message] =~ /connection/ {
      mutate {
        add_tag => [ "connection_event" ]
      }
    }
  }

  # ========================================
  # Parse Kubernetes Logs
  # ========================================
  if [kubernetes] {
    # Add Kubernetes metadata
    mutate {
      add_field => { "k8s_namespace" => "%{[kubernetes][namespace_name]}" }
      add_field => { "k8s_pod" => "%{[kubernetes][pod_name]}" }
      add_field => { "k8s_container" => "%{[kubernetes][container_name]}" }
      add_field => { "k8s_node" => "%{[kubernetes][host]}" }
    }

    # Parse container logs
    if [kubernetes][container_name] == "myapp" {
      json {
        source => "message"
        target => "app"
      }
    }
  }

  # ========================================
  # Security Event Detection
  # ========================================

  # Detect authentication failures
  if [message] =~ /authentication failed|login failed|invalid credentials/i {
    mutate {
      add_tag => [ "security", "auth_failure" ]
      add_field => { "security_event" => "authentication_failure" }
    }
  }

  # Detect SQL injection attempts
  if [message] =~ /union.*select|or.*1.*=.*1|drop.*table/i {
    mutate {
      add_tag => [ "security", "sql_injection" ]
      add_field => { "security_event" => "sql_injection_attempt" }
    }
  }

  # Detect unusual access patterns
  if [client_ip] and [http_path] and [user_agent] {
    # This would integrate with threat intelligence feeds
    # For now, detect obvious bot patterns
    if [user_agent] =~ /bot|crawler|spider|scraper/i {
      mutate {
        add_tag => [ "bot_traffic" ]
      }
    }
  }

  # ========================================
  # Performance Metrics Extraction
  # ========================================

  # Extract database performance metrics
  if "database" in [tags] and [query_duration] {
    metrics {
      meter => [ "database.queries" ]
      timer => { "database.query_duration" => "%{query_duration}" }
      increment => [ "database.slow_queries" ]
      flush_interval => 30
    }
  }

  # Extract HTTP performance metrics
  if [response_time] and [http_status] {
    metrics {
      meter => [ "http.requests" ]
      timer => { "http.response_time" => "%{response_time}" }
      increment => [ "http.status.%{http_status}" ]
    }
  }

  # ========================================
  # Data Cleanup and Standardization
  # ========================================

  # Remove sensitive information
  mutate {
    gsub => [
      "message", "password=[^&\s]*", "password=***",
      "message", "token=[^&\s]*", "token=***",
      "message", "api_key=[^&\s]*", "api_key=***"
    ]
  }

  # Add environment information
  mutate {
    add_field => {
      "environment" => "production"
      "log_processed_at" => "%{@timestamp}"
      "logstash_node" => "${HOSTNAME}"
    }
  }

  # Remove unnecessary fields to reduce storage
  mutate {
    remove_field => [ "[app][pid]", "[app][hostname]", "beat", "prospector" ]
  }
}

output {
  # ========================================
  # Elasticsearch for Search and Analytics
  # ========================================
  elasticsearch {
    hosts => ["elasticsearch-0:9200", "elasticsearch-1:9200", "elasticsearch-2:9200"]
    ssl => true
    ssl_certificate_verification => true
    cacert => "/usr/share/logstash/config/certs/ca.crt"
    user => "logstash_writer"
    password => "${LOGSTASH_WRITER_PASSWORD}"

    # Use index templates for better management
    index => "logs-%{service_name:unknown}-%{+YYYY.MM.dd}"
    template_name => "logs"
    template => "/usr/share/logstash/templates/logs.json"
    template_overwrite => true

    # Document routing for better performance
    routing => "%{service_name}"

    # Retry configuration
    retry_on_conflict => 3
    retry_on_failure => 5

    # Performance settings
    bulk_size => 2000
    flush_size => 2000
    idle_flush_time => 1
  }

  # ========================================
  # Real-time Alerting for Critical Events
  # ========================================
  if "error" in [tags] or "security" in [tags] {
    http {
      url => "https://alertmanager.myapp.com/api/v1/alerts"
      http_method => "post"
      headers => {
        "Content-Type" => "application/json"
        "Authorization" => "Bearer ${ALERTMANAGER_TOKEN}"
      }
      content_type => "application/json"
      format => "json"
      mapping => {
        "alerts" => [
          {
            "labels" => {
              "alertname" => "LogAlert"
              "severity" => "warning"
              "service" => "%{service_name}"
              "environment" => "production"
            }
            "annotations" => {
              "summary" => "Critical log event detected"
              "description" => "%{message}"
              "log_level" => "%{log_level}"
              "timestamp" => "%{@timestamp}"
            }
            "generatorURL" => "https://kibana.myapp.com"
          }
        ]
      }
    }
  }

  # ========================================
  # Long-term Storage for Compliance
  # ========================================
  if [log_level] in ["error", "fatal"] or "security" in [tags] {
    s3 {
      access_key_id => "${AWS_ACCESS_KEY_ID}"
      secret_access_key => "${AWS_SECRET_ACCESS_KEY}"
      region => "us-west-2"
      bucket => "myapp-logs-archive"
      prefix => "year=%{+YYYY}/month=%{+MM}/day=%{+dd}/hour=%{+HH}"
      codec => "json_lines"
      time_file => 60
      size_file => 100485760  # 100MB

      # Server-side encryption
      server_side_encryption_algorithm => "AES256"

      # Lifecycle management via bucket policy
      storage_class => "STANDARD_IA"
    }
  }

  # ========================================
  # Metrics Export to Prometheus
  # ========================================
  if [response_time] {
    statsd {
      host => "prometheus-statsd-exporter"
      port => 8125
      gauge => { "http_response_time" => "%{response_time}" }
      increment => [ "http_requests_total" ]
      sample_rate => 0.1
    }
  }

  # ========================================
  # Debug Output (Development Only)
  # ========================================
  # stdout {
  #   codec => rubydebug { metadata => true }
  # }
}

Disaster Recovery and Backups: Business Continuity Excellence

From Hope-Based Recovery to Tested Resilience

Understanding disaster recovery maturity:

// Disaster recovery evolution: From hope to certainty
const disasterRecoveryMaturity = {
  hopeBasedRecovery: {
    approach: "Assume nothing bad will happen, deal with disasters reactively",
    characteristics: [
      "No formal backup strategy",
      "Occasional manual backups",
      "Recovery procedures untested",
      "Single points of failure everywhere",
    ],
    problems: [
      "Data loss when disasters occur",
      "Extended downtime during recovery",
      "Recovery procedures fail when needed",
      "Business operations halt completely",
    ],
    reality: "Hope is not a strategy - disasters will happen",
  },

  basicBackupStrategy: {
    approach: "Regular backups with basic recovery procedures",
    characteristics: [
      "Automated backup schedules",
      "Multiple backup retention periods",
      "Basic recovery documentation",
      "Some redundancy in critical systems",
    ],
    benefits: [
      "Data protection against common failures",
      "Faster recovery than no backup strategy",
      "Some confidence in business continuity",
    ],
    limitations:
      "Recovery time still significant, procedures may fail under pressure",
  },

  comprehensiveDisasterRecovery: {
    approach:
      "Tested, automated disaster recovery with business continuity planning",
    characteristics: [
      "Multi-tier backup and recovery strategy",
      "Automated failover and recovery systems",
      "Regular disaster recovery testing",
      "Cross-region redundancy and replication",
    ],
    advantages: [
      "Minimal data loss (RPO < 5 minutes)",
      "Minimal downtime (RTO < 30 minutes)",
      "Confidence through regular testing",
      "Business operations continue seamlessly",
    ],
    outcome: "Disasters become minor inconveniences, not existential threats",
  },
};

Comprehensive Backup Strategy Implementation:

#!/bin/bash
# backup-strategy.sh - Professional multi-tier backup system

set -euo pipefail

# Configuration
PROJECT_NAME="${PROJECT_NAME:-myapp}"
ENVIRONMENT="${ENVIRONMENT:-production}"
AWS_REGION="${AWS_REGION:-us-west-2}"
BACKUP_BUCKET="${PROJECT_NAME}-backups-${ENVIRONMENT}"
RETENTION_DAYS_DAILY="${RETENTION_DAYS_DAILY:-30}"
RETENTION_DAYS_WEEKLY="${RETENTION_DAYS_WEEKLY:-90}"
RETENTION_DAYS_MONTHLY="${RETENTION_DAYS_MONTHLY:-365}"

log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}

error() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] ERROR: $1" >&2
    exit 1
}

# ========================================
# Database Backup Strategy
# ========================================

backup_postgresql() {
    local db_host="$1"
    local db_name="$2"
    local backup_type="${3:-full}"  # full, incremental

    log "Starting PostgreSQL backup: $backup_type for $db_name"

    local timestamp=$(date +%Y%m%d_%H%M%S)
    local backup_dir="/tmp/backups/postgres"
    local backup_file="$backup_dir/${db_name}_${backup_type}_${timestamp}"

    mkdir -p "$backup_dir"

    case "$backup_type" in
        "full")
            # Full database dump with compression
            PGPASSWORD="$DB_PASSWORD" pg_dump \
                --host="$db_host" \
                --username="$DB_USER" \
                --dbname="$db_name" \
                --format=custom \
                --compress=9 \
                --verbose \
                --file="${backup_file}.dump"

            # Binary backup using pg_basebackup for faster recovery
            PGPASSWORD="$DB_PASSWORD" pg_basebackup \
                --host="$db_host" \
                --username="$DB_USER" \
                --format=tar \
                --gzip \
                --compress=9 \
                --progress \
                --verbose \
                --wal-method=stream \
                --directory="${backup_file}_basebackup"

            # Tar the basebackup directory
            tar -czf "${backup_file}_basebackup.tar.gz" -C "${backup_file}_basebackup" .
            rm -rf "${backup_file}_basebackup"
            ;;

        "incremental")
            # WAL archive backup for point-in-time recovery
            local wal_backup_dir="$backup_dir/wal_archive"
            mkdir -p "$wal_backup_dir"

            # Sync WAL files from archive location
            aws s3 sync "s3://$BACKUP_BUCKET/wal_archive/" "$wal_backup_dir/" \
                --region "$AWS_REGION"

            # Create incremental backup metadata
            cat > "${backup_file}_incremental.json" << EOF
{
  "backup_type": "incremental",
  "timestamp": "$timestamp",
  "base_backup_lsn": "$(PGPASSWORD="$DB_PASSWORD" psql -h "$db_host" -U "$DB_USER" -d "$db_name" -t -c "SELECT pg_current_wal_lsn();" | tr -d ' ')",
  "wal_files_count": $(find "$wal_backup_dir" -name "*.wal" | wc -l),
  "total_size": "$(du -sh "$wal_backup_dir" | cut -f1)"
}
EOF
            ;;
    esac

    # Calculate checksums for integrity verification
    find "$backup_dir" -name "*${timestamp}*" -type f -exec sha256sum {} \; > "${backup_file}_checksums.txt"

    # Upload to S3 with server-side encryption
    aws s3 sync "$backup_dir/" "s3://$BACKUP_BUCKET/postgres/" \
        --region "$AWS_REGION" \
        --storage-class STANDARD_IA \
        --server-side-encryption AES256 \
        --metadata backup_type="$backup_type",timestamp="$timestamp",environment="$ENVIRONMENT"

    # Verify upload integrity
    aws s3api head-object \
        --bucket "$BACKUP_BUCKET" \
        --key "postgres/${db_name}_${backup_type}_${timestamp}.dump" \
        --region "$AWS_REGION" || error "Backup upload verification failed"

    # Clean up local files
    rm -rf "$backup_dir"/*${timestamp}*

    log "PostgreSQL backup completed successfully"
}

# ========================================
# Kubernetes Resources Backup
# ========================================

backup_kubernetes_resources() {
    local namespace="$1"

    log "Starting Kubernetes resources backup for namespace: $namespace"

    local timestamp=$(date +%Y%m%d_%H%M%S)
    local backup_dir="/tmp/backups/kubernetes"
    local backup_file="$backup_dir/k8s_${namespace}_${timestamp}"

    mkdir -p "$backup_dir"

    # Backup all resources in namespace
    kubectl get all,configmaps,secrets,persistentvolumeclaims,ingresses \
        --namespace="$namespace" \
        --output=yaml > "${backup_file}_resources.yaml"

    # Backup persistent volumes data using Velero if available
    if command -v velero &> /dev/null; then
        log "Creating Velero backup for namespace: $namespace"
        velero backup create "backup-${namespace}-${timestamp}" \
            --include-namespaces="$namespace" \
            --storage-location=default \
            --volume-snapshot-locations=default \
            --ttl=720h
    fi

    # Backup cluster-level resources
    kubectl get nodes,persistentvolumes,storageclasses,clusterroles,clusterrolebindings \
        --output=yaml > "${backup_file}_cluster_resources.yaml"

    # Backup Helm releases
    if command -v helm &> /dev/null; then
        helm list --namespace="$namespace" --output=json > "${backup_file}_helm_releases.json"

        # Export each Helm release values
        helm list --namespace="$namespace" --short | while read -r release; do
            if [ -n "$release" ]; then
                helm get values "$release" --namespace="$namespace" > "${backup_file}_helm_${release}_values.yaml"
            fi
        done
    fi

    # Create backup metadata
    cat > "${backup_file}_metadata.json" << EOF
{
  "backup_type": "kubernetes_resources",
  "namespace": "$namespace",
  "timestamp": "$timestamp",
  "cluster_version": "$(kubectl version --short --client=false | grep Server | cut -d' ' -f3)",
  "node_count": $(kubectl get nodes --no-headers | wc -l),
  "pod_count": $(kubectl get pods --namespace="$namespace" --no-headers | wc -l)
}
EOF

    # Compress backup files
    tar -czf "${backup_file}.tar.gz" -C "$backup_dir" $(basename "${backup_file}")*
    rm -f "${backup_file}"*

    # Upload to S3
    aws s3 cp "${backup_file}.tar.gz" "s3://$BACKUP_BUCKET/kubernetes/" \
        --region "$AWS_REGION" \
        --storage-class STANDARD_IA \
        --server-side-encryption AES256

    # Clean up local files
    rm -f "${backup_file}.tar.gz"

    log "Kubernetes backup completed successfully"
}

# ========================================
# Application Data Backup
# ========================================

backup_application_data() {
    local data_path="$1"
    local backup_name="$2"

    log "Starting application data backup: $backup_name"

    local timestamp=$(date +%Y%m%d_%H%M%S)
    local backup_dir="/tmp/backups/application"
    local backup_file="$backup_dir/${backup_name}_${timestamp}"

    mkdir -p "$backup_dir"

    # Create compressed archive with progress
    tar -czf "${backup_file}.tar.gz" \
        --directory="$(dirname "$data_path")" \
        --verbose \
        --exclude='*.tmp' \
        --exclude='*.log' \
        --exclude='cache/*' \
        "$(basename "$data_path")"

    # Generate integrity checksum
    sha256sum "${backup_file}.tar.gz" > "${backup_file}.sha256"

    # Create backup manifest
    cat > "${backup_file}_manifest.json" << EOF
{
  "backup_name": "$backup_name",
  "source_path": "$data_path",
  "timestamp": "$timestamp",
  "size_bytes": $(stat -f%z "${backup_file}.tar.gz" 2>/dev/null || stat -c%s "${backup_file}.tar.gz"),
  "file_count": $(tar -tzf "${backup_file}.tar.gz" | wc -l),
  "checksum": "$(cut -d' ' -f1 < "${backup_file}.sha256")"
}
EOF

    # Upload to S3 with lifecycle transition
    aws s3 cp "${backup_file}.tar.gz" "s3://$BACKUP_BUCKET/application/" \
        --region "$AWS_REGION" \
        --storage-class STANDARD_IA \
        --server-side-encryption AES256 \
        --metadata backup_name="$backup_name",timestamp="$timestamp"

    aws s3 cp "${backup_file}.sha256" "s3://$BACKUP_BUCKET/application/" \
        --region "$AWS_REGION" \
        --storage-class STANDARD_IA \
        --server-side-encryption AES256

    aws s3 cp "${backup_file}_manifest.json" "s3://$BACKUP_BUCKET/application/" \
        --region "$AWS_REGION" \
        --storage-class STANDARD_IA \
        --server-side-encryption AES256

    # Clean up local files
    rm -f "${backup_file}"*

    log "Application data backup completed successfully"
}

# ========================================
# Backup Verification and Testing
# ========================================

verify_backup_integrity() {
    local backup_type="$1"
    local backup_identifier="$2"

    log "Verifying backup integrity: $backup_type/$backup_identifier"

    case "$backup_type" in
        "postgres")
            # Download and verify database backup
            local temp_dir="/tmp/verify_backup"
            mkdir -p "$temp_dir"

            aws s3 cp "s3://$BACKUP_BUCKET/postgres/${backup_identifier}.dump" \
                "$temp_dir/" --region "$AWS_REGION"

            # Verify dump file integrity
            if ! pg_restore --list "$temp_dir/${backup_identifier}.dump" &>/dev/null; then
                error "Database backup integrity check failed"
            fi
            ;;

        "kubernetes")
            # Verify Kubernetes backup YAML validity
            local temp_dir="/tmp/verify_backup"
            mkdir -p "$temp_dir"

            aws s3 cp "s3://$BACKUP_BUCKET/kubernetes/${backup_identifier}.tar.gz" \
                "$temp_dir/" --region "$AWS_REGION"

            tar -xzf "$temp_dir/${backup_identifier}.tar.gz" -C "$temp_dir"

            # Validate YAML files
            find "$temp_dir" -name "*.yaml" -exec kubectl --dry-run=client apply -f {} \; || \
                error "Kubernetes backup validation failed"
            ;;
    esac

    log "Backup integrity verification passed"
    cleanup_temp_files "$temp_dir"
}

# ========================================
# Backup Restoration Procedures
# ========================================

restore_database() {
    local backup_identifier="$1"
    local target_db_host="$2"
    local target_db_name="$3"

    log "Starting database restoration: $backup_identifier"

    # Download backup from S3
    local restore_dir="/tmp/restore"
    mkdir -p "$restore_dir"

    aws s3 cp "s3://$BACKUP_BUCKET/postgres/${backup_identifier}.dump" \
        "$restore_dir/" --region "$AWS_REGION"

    # Verify checksum if available
    if aws s3 ls "s3://$BACKUP_BUCKET/postgres/${backup_identifier}_checksums.txt" --region "$AWS_REGION" &>/dev/null; then
        aws s3 cp "s3://$BACKUP_BUCKET/postgres/${backup_identifier}_checksums.txt" \
            "$restore_dir/" --region "$AWS_REGION"

        cd "$restore_dir"
        if ! sha256sum --check "${backup_identifier}_checksums.txt"; then
            error "Backup file integrity check failed"
        fi
        cd - > /dev/null
    fi

    # Create restoration database if needed
    PGPASSWORD="$DB_PASSWORD" psql \
        --host="$target_db_host" \
        --username="$DB_USER" \
        --command="CREATE DATABASE ${target_db_name}_restore;" || true

    # Restore database
    PGPASSWORD="$DB_PASSWORD" pg_restore \
        --host="$target_db_host" \
        --username="$DB_USER" \
        --dbname="${target_db_name}_restore" \
        --verbose \
        --clean \
        --if-exists \
        --no-owner \
        --no-privileges \
        "$restore_dir/${backup_identifier}.dump"

    # Verify restoration
    local restored_tables=$(PGPASSWORD="$DB_PASSWORD" psql \
        --host="$target_db_host" \
        --username="$DB_USER" \
        --dbname="${target_db_name}_restore" \
        --tuples-only \
        --command="SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';")

    if [ "$restored_tables" -eq 0 ]; then
        error "Database restoration failed - no tables found"
    fi

    log "Database restoration completed successfully: $restored_tables tables restored"
    cleanup_temp_files "$restore_dir"
}

# ========================================
# Backup Lifecycle Management
# ========================================

cleanup_old_backups() {
    log "Starting backup lifecycle management"

    local current_date=$(date +%s)
    local daily_cutoff=$((current_date - (RETENTION_DAYS_DAILY * 86400)))
    local weekly_cutoff=$((current_date - (RETENTION_DAYS_WEEKLY * 86400)))
    local monthly_cutoff=$((current_date - (RETENTION_DAYS_MONTHLY * 86400)))

    # Get all backup objects
    aws s3api list-objects-v2 \
        --bucket "$BACKUP_BUCKET" \
        --region "$AWS_REGION" \
        --query 'Contents[?StorageClass!=`GLACIER`].[Key,LastModified]' \
        --output text | while read -r key last_modified; do

        local modified_timestamp=$(date -d "$last_modified" +%s)
        local age_days=$(( (current_date - modified_timestamp) / 86400 ))

        # Apply retention policies
        if [ $modified_timestamp -lt $monthly_cutoff ]; then
            log "Deleting old backup: $key (age: $age_days days)"
            aws s3 rm "s3://$BACKUP_BUCKET/$key" --region "$AWS_REGION"
        elif [ $modified_timestamp -lt $weekly_cutoff ] && [[ ! "$key" =~ weekly ]]; then
            # Transition to cheaper storage
            aws s3api copy-object \
                --bucket "$BACKUP_BUCKET" \
                --copy-source "$BACKUP_BUCKET/$key" \
                --key "$key" \
                --storage-class GLACIER \
                --metadata-directive COPY \
                --region "$AWS_REGION"
        fi
    done

    log "Backup lifecycle management completed"
}

# ========================================
# Disaster Recovery Testing
# ========================================

disaster_recovery_test() {
    local test_type="$1"  # full, database, application

    log "Starting disaster recovery test: $test_type"

    case "$test_type" in
        "full")
            # Test complete environment restoration
            test_database_recovery
            test_kubernetes_recovery
            test_application_data_recovery
            ;;
        "database")
            test_database_recovery
            ;;
        "application")
            test_application_data_recovery
            ;;
    esac

    log "Disaster recovery test completed successfully"
}

test_database_recovery() {
    log "Testing database recovery procedures"

    # Find latest backup
    local latest_backup=$(aws s3 ls "s3://$BACKUP_BUCKET/postgres/" --region "$AWS_REGION" | \
        grep "full" | sort | tail -n1 | awk '{print $4}' | sed 's/.dump$//')

    if [ -z "$latest_backup" ]; then
        error "No database backups found for testing"
    fi

    # Test restoration to temporary database
    restore_database "$latest_backup" "$DB_HOST" "test_restore_$(date +%s)"

    log "Database recovery test passed"
}

# Main command dispatcher
case "${1:-help}" in
    "postgres")
        backup_postgresql "${2:-$DB_HOST}" "${3:-$DB_NAME}" "${4:-full}"
        ;;
    "kubernetes")
        backup_kubernetes_resources "${2:-production}"
        ;;
    "application")
        backup_application_data "${2:-/app/data}" "${3:-appdata}"
        ;;
    "verify")
        verify_backup_integrity "$2" "$3"
        ;;
    "restore-db")
        restore_database "$2" "${3:-$DB_HOST}" "${4:-$DB_NAME}"
        ;;
    "cleanup")
        cleanup_old_backups
        ;;
    "test-dr")
        disaster_recovery_test "${2:-database}"
        ;;
    "full-backup")
        backup_postgresql "$DB_HOST" "$DB_NAME" "full"
        backup_kubernetes_resources "production"
        backup_application_data "/app/data" "appdata"
        ;;
    "help"|*)
        cat << EOF
Professional Backup and Disaster Recovery System

Usage: $0 <command> [options]

Commands:
    postgres <host> <dbname> [type]     Backup PostgreSQL database
    kubernetes <namespace>              Backup Kubernetes resources
    application <path> <name>           Backup application data
    verify <type> <identifier>          Verify backup integrity
    restore-db <backup> [host] [db]     Restore database from backup
    cleanup                             Clean up old backups per retention policy
    test-dr [type]                     Run disaster recovery test
    full-backup                        Run complete backup suite

Examples:
    $0 postgres mydb-host myapp full
    $0 kubernetes production
    $0 application /app/data userdata
    $0 verify postgres myapp_full_20231201_120000
    $0 test-dr full
EOF
        ;;
esac

Conclusion: From Infrastructure Amateur to Operations Excellence

You’ve now mastered the complete deployment and infrastructure ecosystem that separates professional operations from amateur setups that crumble under real-world pressure.

What you’ve accomplished:

  • CI/CD Pipeline Mastery: Automated deployment pipelines with comprehensive testing, security scanning, and intelligent deployment strategies that eliminate manual errors and enable confident, frequent releases
  • Advanced Infrastructure as Code: Modular, collaborative infrastructure management with state management, environment consistency, and collaborative workflows that make infrastructure changes predictable and reviewable
  • Proactive Monitoring Excellence: Comprehensive observability with intelligent alerting, anomaly detection, and operational insights that prevent problems instead of reacting to them
  • Intelligent Log Management: Centralized log aggregation with real-time analysis, security event detection, and operational intelligence that transforms logs from debugging aids into proactive operational tools
  • Battle-tested Disaster Recovery: Multi-tier backup strategies with automated recovery procedures, regular testing, and business continuity planning that makes disasters minor inconveniences instead of existential threats

The professional operations transformation you’ve achieved:

// Your operations evolution: From amateur to excellence
const operationalTransformation = {
  before: {
    deployments: "Manual SSH, pray nothing breaks, debug in production",
    monitoring: "Customers tell us when things break via angry emails",
    infrastructure:
      "Snowflake servers, configuration drift, single points of failure",
    logging: "SSH and grep through scattered log files when debugging",
    disasterRecovery: "Hope nothing bad happens, panic when it does",
    teamProductivity: "80% time firefighting, 20% building features",
    customerExperience:
      "Unpredictable outages, slow performance, data loss risk",
  },

  after: {
    deployments:
      "Automated CI/CD with zero-downtime, automatic rollback, comprehensive testing",
    monitoring:
      "Proactive alerts before customer impact, predictive problem prevention",
    infrastructure:
      "Infrastructure as code, consistent environments, auto-scaling resilience",
    logging:
      "Centralized intelligence with real-time analysis and security detection",
    disasterRecovery:
      "Tested procedures with automated failover and minimal downtime",
    teamProductivity:
      "5% time on operations, 95% time on innovation and features",
    customerExperience: "Reliable service, optimal performance, zero data loss",
  },

  businessImpact: [
    "Deploy 10x more frequently with higher reliability",
    "Mean time to resolution reduced from hours to minutes",
    "Infrastructure costs optimized through automation and monitoring",
    "Development velocity increased through operational excellence",
    "Customer satisfaction improved through reliable service delivery",
    "Competitive advantage through faster innovation cycles",
  ],
};

You now operate infrastructure that scales confidently, deploys reliably, monitors proactively, and recovers gracefully. Your systems enable teams to focus on building value instead of fighting operational fires.

But here’s the production reality that creates the biggest impact: these operational practices aren’t just about technology—they’re about enabling business success. While your competitors struggle with manual deployments, reactive monitoring, and disaster recovery panic, your infrastructure becomes an invisible foundation that just works, allowing your team to outpace the market with rapid innovation and reliable service delivery.

Your operations are no longer a liability that might break—they’re a competitive advantage that enables greatness. The next time someone asks if you’re ready for production scale, you won’t just say yes—you’ll demonstrate it with the confidence that comes from professional operational excellence.

Welcome to the ranks of engineers who build systems that businesses can depend on. Your infrastructure is now ready for whatever scale and challenges come next.