Scalability & High Availability - 2/2

The $2.4 Billion Amazon Prime Day That Almost Wasn’t

Picture this nightmare scenario: It’s July 15th, 2018. Amazon’s biggest shopping event of the year is about to launch. Millions of customers are hitting refresh, ready to spend billions on deals. The marketing team has spent months building hype. Third-party sellers have invested their life savings in inventory.

And then, at exactly 12:00 PM PST, everything goes sideways.

For the first hour and a half of Prime Day, Amazon’s homepage displayed nothing but cute dog photos with error messages. Not exactly the billion-dollar shopping experience customers expected.

The cascade of failures was absolutely brutal:

  • Complete website breakdown: The main Amazon.com site became completely inaccessible for over 90 minutes
  • Mobile app crashes: iOS and Android apps crashed instantly under the unexpected traffic spike
  • Third-party seller panic: Marketplace vendors watched their biggest sales day disappear in real-time
  • International domino effect: The failure spread to Amazon’s European sites, compounding the disaster
  • Customer service meltdown: Support systems couldn’t handle the flood of angry customers
  • Inventory management chaos: Warehouse systems couldn’t process the backlog of orders once service resumed

The financial damage was staggering:

  • $100 million in lost sales during the 90-minute outage
  • $2.4 billion less revenue than projected for the entire event
  • Stock price dropped 1.2% in after-hours trading (that’s $12 billion in market cap)
  • Competitor gains: Target, eBay, and Walmart all reported massive traffic spikes as customers fled
  • Brand reputation hit: #AmazonDown trending worldwide with millions of mocking memes
  • 6 months of engineering effort to prevent similar failures

Here’s the kicker: Amazon has some of the world’s best engineers, unlimited resources, and years of experience handling massive traffic spikes. Yet a single point of failure in their traffic routing system brought down the entire operation.

The lesson? High availability isn’t about having redundant servers—it’s about having redundant everything, with intelligent failover strategies that actually work when the world is watching.

The Uncomfortable Truth About High Availability

Here’s what separates systems that stay online during disasters from those that become cautionary tales: True high availability isn’t about preventing failures—it’s about designing systems where failures are invisible to users because everything fails gracefully and recovers automatically.

Most developers approach high availability like this:

  1. Set up a backup server and call it “redundancy”
  2. Assume their cloud provider handles everything else
  3. Test disaster recovery once a year during a planned maintenance window
  4. Discover during actual outages that their “highly available” system has seventeen single points of failure
  5. Spend the next six months explaining to executives why “99.9% uptime” actually means 8+ hours of downtime per year

But systems that achieve true enterprise-level availability work differently:

  1. Assume everything will fail and design for graceful degradation from day one
  2. Implement intelligent failover that works faster than users can notice
  3. Test disaster scenarios continuously using chaos engineering principles
  4. Build across multiple regions so natural disasters can’t take down their entire operation
  5. Monitor everything obsessively with automated recovery that fixes problems before humans even notice

The difference isn’t just uptime percentages—it’s the difference between systems that fail visibly and embarrassingly and systems that fail invisibly and recover gracefully.

Ready to build applications that stay online even when data centers catch fire? Let’s dive into the high availability patterns that power the world’s most reliable systems.


High Availability Patterns: Building Bulletproof Systems

The Problem: Single Points of Failure Everywhere

// The "highly available" system that isn't
class FragileWebService {
  private database: Database;
  private cache: Cache;
  private externalAPI: ExternalAPI;

  constructor() {
    // Single database connection - RED FLAG #1
    this.database = new Database({
      host: "db-primary.company.com", // What if this host dies?
      connectionString: "postgresql://user:pass@db-primary:5432/app",
    });

    // Single cache instance - RED FLAG #2
    this.cache = new Cache({
      host: "cache-server.company.com", // Another single point of failure
      port: 6379,
    });

    // External API without fallback - RED FLAG #3
    this.externalAPI = new ExternalAPI({
      baseUrl: "https://payments-api.thirdparty.com",
      timeout: 30000,
    });
  }

  async getUser(userId: string): Promise<User> {
    try {
      // Try cache first
      const cached = await this.cache.get(`user_${userId}`);
      if (cached) {
        return cached;
      }

      // Fall back to database - RED FLAG #4
      // If database is slow, entire request blocks
      const user = await this.database.query(
        "SELECT * FROM users WHERE id = ?",
        [userId]
      );

      if (!user) {
        throw new Error("User not found");
      }

      // Cache for next time
      await this.cache.set(`user_${userId}`, user, 3600);

      return user;
    } catch (error) {
      // No graceful degradation - RED FLAG #5
      // When cache or database fails, everything fails
      throw new ServiceUnavailableError("User service unavailable");
    }
  }

  async processPayment(paymentData: PaymentRequest): Promise<PaymentResult> {
    try {
      // Synchronous external API call - RED FLAG #6
      // If payments API is down, our entire checkout flow breaks
      const result = await this.externalAPI.post("/payments", paymentData);

      // Store transaction in database
      await this.database.query(
        "INSERT INTO transactions (user_id, amount, status) VALUES (?, ?, ?)",
        [paymentData.userId, paymentData.amount, result.status]
      );

      return result;
    } catch (error) {
      // No retry logic or fallback - RED FLAG #7
      // Temporary network issues cause permanent failures
      throw new PaymentFailedError("Payment processing failed");
    }
  }

  // What happens during failures:
  // 1. Database goes down → Entire application becomes unusable
  // 2. Cache goes down → Database gets overwhelmed
  // 3. External API is slow → All user requests timeout
  // 4. Network hiccup → Payments fail permanently
  // 5. Single server restart → Complete service interruption
}

The Solution: High Availability Architecture with Intelligent Failover

// Enterprise-grade high availability architecture
export class HighlyAvailableService {
  constructor(
    private databaseCluster: DatabaseCluster,
    private cacheCluster: CacheCluster,
    private loadBalancer: LoadBalancer,
    private circuitBreaker: CircuitBreakerRegistry,
    private healthChecker: HealthChecker,
    private fallbackStrategies: FallbackStrategyRegistry,
    private metrics: MetricsCollector
  ) {}

  async getUser(userId: string): Promise<User> {
    const operationTimer = this.metrics.startTimer("get_user_operation");

    try {
      return await this.executeWithFallback("getUser", async () => {
        // Try distributed cache cluster first
        const cached = await this.cacheCluster.get(`user_${userId}`);
        if (cached !== null) {
          this.metrics.incrementCounter("cache_hit", { operation: "get_user" });
          return cached;
        }

        this.metrics.incrementCounter("cache_miss", { operation: "get_user" });

        // Try read replica first (faster, doesn't block writes)
        const user = await this.databaseCluster.executeOnReadReplica(
          "SELECT * FROM users WHERE id = ?",
          [userId]
        );

        if (user) {
          // Asynchronously update cache (don't block response)
          this.updateCacheAsync(`user_${userId}`, user, 3600);
          return user;
        }

        throw new UserNotFoundError(`User ${userId} not found`);
      });
    } finally {
      operationTimer.end();
    }
  }

  private async executeWithFallback<T>(
    operation: string,
    primaryOperation: () => Promise<T>
  ): Promise<T> {
    const fallbackStrategy = this.fallbackStrategies.get(operation);

    try {
      return await primaryOperation();
    } catch (error) {
      this.metrics.incrementCounter("primary_operation_failed", {
        operation,
        error: error.constructor.name,
      });

      if (fallbackStrategy && fallbackStrategy.shouldFallback(error)) {
        try {
          const result = await fallbackStrategy.execute(error);
          this.metrics.incrementCounter("fallback_success", { operation });
          return result;
        } catch (fallbackError) {
          this.metrics.incrementCounter("fallback_failed", { operation });
          throw fallbackError;
        }
      }

      throw error;
    }
  }

  async processPayment(paymentData: PaymentRequest): Promise<PaymentResult> {
    const operationId = uuidv4();
    const operationTimer = this.metrics.startTimer("payment_processing");

    try {
      return await this.processPaymentWithResilience(paymentData, operationId);
    } finally {
      operationTimer.end();
    }
  }

  private async processPaymentWithResilience(
    paymentData: PaymentRequest,
    operationId: string
  ): Promise<PaymentResult> {
    const paymentCircuitBreaker = this.circuitBreaker.get("payment_service");

    // Check if external payment service is available
    if (!paymentCircuitBreaker.canExecute()) {
      // External service is down - use fallback payment processor
      return await this.processFallbackPayment(paymentData, operationId);
    }

    try {
      // Execute with circuit breaker protection
      const paymentResult = await paymentCircuitBreaker.execute(async () => {
        return await this.executePaymentWithRetry(paymentData, operationId);
      });

      // Store transaction record asynchronously (don't block payment response)
      this.storeTransactionAsync(paymentData, paymentResult, operationId);

      return paymentResult;
    } catch (error) {
      // Circuit breaker opened or payment failed
      if (error instanceof CircuitBreakerOpenError) {
        return await this.processFallbackPayment(paymentData, operationId);
      }

      throw error;
    }
  }

  private async executePaymentWithRetry(
    paymentData: PaymentRequest,
    operationId: string
  ): Promise<PaymentResult> {
    const retryConfig = {
      maxAttempts: 3,
      baseDelay: 1000,
      maxDelay: 10000,
      exponentialBase: 2,
    };

    for (let attempt = 1; attempt <= retryConfig.maxAttempts; attempt++) {
      try {
        const externalPaymentService =
          this.loadBalancer.selectHealthyNode("payment_service");

        const result = await externalPaymentService.processPayment({
          ...paymentData,
          operationId,
          attempt,
        });

        this.metrics.incrementCounter("payment_success", {
          attempt,
          service: externalPaymentService.nodeId,
        });

        return result;
      } catch (error) {
        this.metrics.incrementCounter("payment_attempt_failed", {
          attempt,
          error: error.constructor.name,
        });

        const isLastAttempt = attempt === retryConfig.maxAttempts;
        const isRetryableError = this.isRetryableError(error);

        if (isLastAttempt || !isRetryableError) {
          throw error;
        }

        // Exponential backoff with jitter
        const delay = Math.min(
          retryConfig.baseDelay *
            Math.pow(retryConfig.exponentialBase, attempt - 1),
          retryConfig.maxDelay
        );
        const jitter = delay * 0.1 * Math.random(); // 10% jitter

        await this.sleep(delay + jitter);
      }
    }

    throw new Error("Payment processing failed after all retries");
  }

  private async processFallbackPayment(
    paymentData: PaymentRequest,
    operationId: string
  ): Promise<PaymentResult> {
    this.metrics.incrementCounter("fallback_payment_used");

    // Use secondary payment processor
    const fallbackProcessor = this.loadBalancer.selectHealthyNode(
      "fallback_payment_service"
    );

    const result = await fallbackProcessor.processPayment({
      ...paymentData,
      operationId,
      isFallback: true,
    });

    // Mark this transaction for reconciliation later
    await this.markForReconciliation(paymentData, result, operationId);

    return result;
  }

  private async storeTransactionAsync(
    paymentData: PaymentRequest,
    result: PaymentResult,
    operationId: string
  ): Promise<void> {
    // Fire-and-forget transaction storage with error handling
    setImmediate(async () => {
      try {
        await this.databaseCluster.executeOnPrimary(
          `INSERT INTO transactions (
            id, user_id, amount, status, payment_method, 
            external_id, created_at, operation_id
          ) VALUES (?, ?, ?, ?, ?, ?, NOW(), ?)`,
          [
            result.transactionId,
            paymentData.userId,
            paymentData.amount,
            result.status,
            paymentData.paymentMethod,
            result.externalTransactionId,
            operationId,
          ]
        );
      } catch (error) {
        // If database write fails, queue for retry
        await this.queueTransactionForRetry(paymentData, result, operationId);
      }
    });
  }

  private async updateCacheAsync(
    key: string,
    value: any,
    ttl: number
  ): Promise<void> {
    setImmediate(async () => {
      try {
        await this.cacheCluster.set(key, value, ttl);
      } catch (error) {
        // Cache update failures are not critical - just log
        console.warn(`Failed to update cache for key ${key}:`, error);
      }
    });
  }

  private isRetryableError(error: Error): boolean {
    // Define which errors are worth retrying
    return (
      error instanceof NetworkTimeoutError ||
      error instanceof ConnectionRefusedError ||
      error instanceof TemporaryServiceError ||
      (error instanceof HTTPError && error.statusCode >= 500)
    );
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }
}

// Circuit breaker pattern for external service protection
export class CircuitBreaker {
  private state: CircuitBreakerState = CircuitBreakerState.CLOSED;
  private failureCount = 0;
  private lastFailureTime: Date | null = null;
  private successCount = 0;

  constructor(
    private config: CircuitBreakerConfig,
    private metrics: MetricsCollector
  ) {}

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (!this.canExecute()) {
      throw new CircuitBreakerOpenError("Circuit breaker is OPEN");
    }

    const startTime = Date.now();

    try {
      const result = await operation();
      this.onSuccess();

      this.metrics.recordExecutionTime(
        `circuit_breaker.${this.config.name}`,
        Date.now() - startTime
      );

      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  canExecute(): boolean {
    switch (this.state) {
      case CircuitBreakerState.CLOSED:
        return true;

      case CircuitBreakerState.OPEN:
        // Check if we should attempt to recover
        if (this.shouldAttemptRecovery()) {
          this.state = CircuitBreakerState.HALF_OPEN;
          this.metrics.incrementCounter(
            `circuit_breaker.${this.config.name}.half_open`
          );
          return true;
        }
        return false;

      case CircuitBreakerState.HALF_OPEN:
        return true;

      default:
        return false;
    }
  }

  private onSuccess(): void {
    this.successCount++;

    if (this.state === CircuitBreakerState.HALF_OPEN) {
      if (this.successCount >= this.config.recoverySuccessThreshold) {
        this.state = CircuitBreakerState.CLOSED;
        this.failureCount = 0;
        this.successCount = 0;
        this.metrics.incrementCounter(
          `circuit_breaker.${this.config.name}.closed`
        );
      }
    } else if (this.state === CircuitBreakerState.CLOSED) {
      // Reset failure count on successful operation
      this.failureCount = 0;
    }
  }

  private onFailure(): void {
    this.failureCount++;
    this.lastFailureTime = new Date();
    this.successCount = 0; // Reset success count

    if (
      this.state === CircuitBreakerState.CLOSED ||
      this.state === CircuitBreakerState.HALF_OPEN
    ) {
      if (this.failureCount >= this.config.failureThreshold) {
        this.state = CircuitBreakerState.OPEN;
        this.metrics.incrementCounter(
          `circuit_breaker.${this.config.name}.open`
        );
      }
    }
  }

  private shouldAttemptRecovery(): boolean {
    if (!this.lastFailureTime) return false;

    const timeSinceLastFailure = Date.now() - this.lastFailureTime.getTime();
    return timeSinceLastFailure >= this.config.recoveryTimeoutMs;
  }

  getState(): CircuitBreakerState {
    return this.state;
  }

  getStats(): CircuitBreakerStats {
    return {
      state: this.state,
      failureCount: this.failureCount,
      successCount: this.successCount,
      lastFailureTime: this.lastFailureTime,
    };
  }
}

// Database cluster with automatic failover
export class DatabaseCluster {
  private primaryNode: DatabaseNode;
  private readReplicas: DatabaseNode[] = [];
  private isFailoverInProgress = false;

  constructor(
    private config: DatabaseClusterConfig,
    private healthChecker: DatabaseHealthChecker,
    private metrics: MetricsCollector
  ) {
    this.initializeCluster();
    this.startHealthMonitoring();
  }

  async executeOnPrimary(query: string, params: any[] = []): Promise<any> {
    if (!this.primaryNode || !this.primaryNode.isHealthy()) {
      // Attempt failover to a healthy read replica
      await this.attemptFailover();
    }

    if (!this.primaryNode || !this.primaryNode.isHealthy()) {
      throw new DatabaseUnavailableError(
        "No healthy primary database available"
      );
    }

    return await this.primaryNode.execute(query, params);
  }

  async executeOnReadReplica(query: string, params: any[] = []): Promise<any> {
    const healthyReplicas = this.readReplicas.filter((replica) =>
      replica.isHealthy()
    );

    if (healthyReplicas.length === 0) {
      // Fall back to primary if no read replicas are available
      return await this.executeOnPrimary(query, params);
    }

    // Select replica with lowest current load
    const selectedReplica = healthyReplicas.reduce((best, current) => {
      return current.getCurrentLoad() < best.getCurrentLoad() ? current : best;
    });

    try {
      return await selectedReplica.execute(query, params);
    } catch (error) {
      // If read replica fails, try primary as fallback
      this.metrics.incrementCounter("read_replica_failover");
      return await this.executeOnPrimary(query, params);
    }
  }

  private async attemptFailover(): Promise<void> {
    if (this.isFailoverInProgress) {
      // Wait for current failover to complete
      return await this.waitForFailoverCompletion();
    }

    this.isFailoverInProgress = true;

    try {
      console.log("Starting database failover process...");
      this.metrics.incrementCounter("database_failover_started");

      // Find the most up-to-date read replica
      const bestReplica = await this.selectFailoverCandidate();

      if (!bestReplica) {
        throw new FailoverError("No suitable failover candidate found");
      }

      // Promote read replica to primary
      await this.promoteReplicaToPrimary(bestReplica);

      // Update application configuration
      this.primaryNode = bestReplica;

      // Remove promoted node from read replica list
      this.readReplicas = this.readReplicas.filter(
        (replica) => replica !== bestReplica
      );

      console.log(
        `Database failover completed. New primary: ${bestReplica.nodeId}`
      );
      this.metrics.incrementCounter("database_failover_success");
    } catch (error) {
      console.error("Database failover failed:", error);
      this.metrics.incrementCounter("database_failover_failed");
      throw error;
    } finally {
      this.isFailoverInProgress = false;
    }
  }

  private async selectFailoverCandidate(): Promise<DatabaseNode | null> {
    const candidates = this.readReplicas.filter((replica) =>
      replica.isHealthy()
    );

    if (candidates.length === 0) {
      return null;
    }

    // Select replica with least lag (most up-to-date data)
    const candidateMetrics = await Promise.all(
      candidates.map(async (candidate) => ({
        node: candidate,
        lag: await candidate.getReplicationLag(),
        load: candidate.getCurrentLoad(),
      }))
    );

    // Sort by lag (ascending) then by load (ascending)
    candidateMetrics.sort((a, b) => {
      if (a.lag !== b.lag) return a.lag - b.lag;
      return a.load - b.load;
    });

    return candidateMetrics[0].node;
  }

  private async promoteReplicaToPrimary(replica: DatabaseNode): Promise<void> {
    // Promote the replica to accept writes
    await replica.promoteToMaster();

    // Update DNS or load balancer configuration
    await this.updateRoutingConfiguration(replica);

    // Verify promotion was successful
    const isWritable = await replica.testWriteAccess();
    if (!isWritable) {
      throw new FailoverError("Failed to promote replica to writable primary");
    }
  }

  private async updateRoutingConfiguration(
    newPrimary: DatabaseNode
  ): Promise<void> {
    // In production, this would update DNS records, load balancer config, etc.
    // For this example, we'll simulate the configuration update
    await new Promise((resolve) => setTimeout(resolve, 2000)); // 2 second delay
  }

  private async waitForFailoverCompletion(): Promise<void> {
    let attempts = 0;
    const maxAttempts = 30; // 30 seconds max wait

    while (this.isFailoverInProgress && attempts < maxAttempts) {
      await new Promise((resolve) => setTimeout(resolve, 1000));
      attempts++;
    }

    if (this.isFailoverInProgress) {
      throw new FailoverTimeoutError("Failover process timed out");
    }
  }

  private initializeCluster(): void {
    // Initialize database nodes based on configuration
    this.primaryNode = new DatabaseNode(
      this.config.primary.nodeId,
      this.config.primary.connectionString,
      "primary"
    );

    this.readReplicas = this.config.readReplicas.map(
      (replicaConfig) =>
        new DatabaseNode(
          replicaConfig.nodeId,
          replicaConfig.connectionString,
          "replica"
        )
    );
  }

  private startHealthMonitoring(): void {
    // Monitor primary node
    this.healthChecker.monitor(this.primaryNode, {
      interval: 10000, // Check every 10 seconds
      onHealthChange: (node, isHealthy) => {
        if (!isHealthy && node === this.primaryNode) {
          console.warn(`Primary database node ${node.nodeId} became unhealthy`);
          // Trigger failover in next operation
        }
      },
    });

    // Monitor read replicas
    this.readReplicas.forEach((replica) => {
      this.healthChecker.monitor(replica, {
        interval: 15000, // Check every 15 seconds
        onHealthChange: (node, isHealthy) => {
          if (!isHealthy) {
            console.warn(`Read replica ${node.nodeId} became unhealthy`);
          } else {
            console.log(`Read replica ${node.nodeId} recovered`);
          }
        },
      });
    });
  }

  getClusterStatus(): DatabaseClusterStatus {
    return {
      primaryNode: {
        nodeId: this.primaryNode?.nodeId || "none",
        healthy: this.primaryNode?.isHealthy() || false,
        load: this.primaryNode?.getCurrentLoad() || 0,
      },
      readReplicas: this.readReplicas.map((replica) => ({
        nodeId: replica.nodeId,
        healthy: replica.isHealthy(),
        load: replica.getCurrentLoad(),
        lagMs: replica.getReplicationLag(),
      })),
      isFailoverInProgress: this.isFailoverInProgress,
    };
  }
}

// Load balancer with intelligent routing
export class LoadBalancer {
  private serviceNodes = new Map<string, ServiceNode[]>();
  private routingAlgorithm: RoutingAlgorithm;

  constructor(
    private config: LoadBalancerConfig,
    private healthChecker: ServiceHealthChecker
  ) {
    this.routingAlgorithm = this.createRoutingAlgorithm(config.algorithm);
    this.initializeServices();
  }

  selectHealthyNode(serviceName: string): ServiceNode {
    const nodes = this.serviceNodes.get(serviceName);
    if (!nodes) {
      throw new ServiceNotFoundError(`Service ${serviceName} not configured`);
    }

    const healthyNodes = nodes.filter((node) => node.isHealthy());
    if (healthyNodes.length === 0) {
      throw new NoHealthyNodesError(
        `No healthy nodes available for service ${serviceName}`
      );
    }

    return this.routingAlgorithm.selectNode(healthyNodes);
  }

  getAllHealthyNodes(serviceName: string): ServiceNode[] {
    const nodes = this.serviceNodes.get(serviceName);
    if (!nodes) {
      throw new ServiceNotFoundError(`Service ${serviceName} not configured`);
    }

    return nodes.filter((node) => node.isHealthy());
  }

  async executeWithLoadBalancing<T>(
    serviceName: string,
    operation: (node: ServiceNode) => Promise<T>,
    retries: number = 2
  ): Promise<T> {
    let lastError: Error | null = null;

    for (let attempt = 0; attempt <= retries; attempt++) {
      try {
        const node = this.selectHealthyNode(serviceName);
        return await operation(node);
      } catch (error) {
        lastError = error as Error;

        if (attempt < retries) {
          // Brief delay before retry
          await new Promise((resolve) =>
            setTimeout(resolve, 100 * (attempt + 1))
          );
        }
      }
    }

    throw lastError || new Error("Load balancing operation failed");
  }

  private createRoutingAlgorithm(algorithm: string): RoutingAlgorithm {
    switch (algorithm) {
      case "round_robin":
        return new RoundRobinRouting();
      case "least_connections":
        return new LeastConnectionsRouting();
      case "weighted_response_time":
        return new WeightedResponseTimeRouting();
      case "resource_based":
        return new ResourceBasedRouting();
      default:
        throw new Error(`Unknown routing algorithm: ${algorithm}`);
    }
  }

  private initializeServices(): void {
    for (const [serviceName, serviceConfig] of Object.entries(
      this.config.services
    )) {
      const nodes = serviceConfig.nodes.map(
        (nodeConfig) =>
          new ServiceNode(
            nodeConfig.nodeId,
            nodeConfig.endpoint,
            serviceConfig.healthCheck
          )
      );

      this.serviceNodes.set(serviceName, nodes);

      // Start health checking for all nodes
      nodes.forEach((node) => {
        this.healthChecker.monitor(node, {
          interval: serviceConfig.healthCheck.interval,
          timeout: serviceConfig.healthCheck.timeout,
          onHealthChange: (node, isHealthy) => {
            console.log(
              `Service node ${node.nodeId} health changed: ${
                isHealthy ? "healthy" : "unhealthy"
              }`
            );
          },
        });
      });
    }
  }
}

// Supporting types and interfaces
export enum CircuitBreakerState {
  CLOSED = "closed",
  OPEN = "open",
  HALF_OPEN = "half_open",
}

export interface CircuitBreakerConfig {
  name: string;
  failureThreshold: number;
  recoveryTimeoutMs: number;
  recoverySuccessThreshold: number;
}

export interface CircuitBreakerStats {
  state: CircuitBreakerState;
  failureCount: number;
  successCount: number;
  lastFailureTime: Date | null;
}

export interface DatabaseClusterConfig {
  primary: {
    nodeId: string;
    connectionString: string;
  };
  readReplicas: Array<{
    nodeId: string;
    connectionString: string;
  }>;
}

export interface DatabaseClusterStatus {
  primaryNode: {
    nodeId: string;
    healthy: boolean;
    load: number;
  };
  readReplicas: Array<{
    nodeId: string;
    healthy: boolean;
    load: number;
    lagMs: number;
  }>;
  isFailoverInProgress: boolean;
}

export class CircuitBreakerOpenError extends Error {
  constructor(message: string) {
    super(message);
    this.name = "CircuitBreakerOpenError";
  }
}

export class DatabaseUnavailableError extends Error {
  constructor(message: string) {
    super(message);
    this.name = "DatabaseUnavailableError";
  }
}

export class FailoverError extends Error {
  constructor(message: string) {
    super(message);
    this.name = "FailoverError";
  }
}

export class FailoverTimeoutError extends Error {
  constructor(message: string) {
    super(message);
    this.name = "FailoverTimeoutError";
  }
}

Disaster Recovery Planning: When Everything Goes Wrong

The Problem: Disaster Recovery as an Afterthought

// The "disaster recovery" plan that guarantees disasters
class DisasterProneService {
  private database: Database;
  private fileStorage: FileStorage;

  constructor() {
    this.database = new Database({
      host: "primary-db.us-east-1.amazonaws.com",
      // Single region, single AZ - RED FLAG #1
    });

    this.fileStorage = new FileStorage({
      bucket: "app-data-primary",
      region: "us-east-1",
      // No cross-region replication - RED FLAG #2
    });
  }

  // "Backup" strategy
  async createBackup(): Promise<void> {
    try {
      // Manual backup process - RED FLAG #3
      // What happens if this fails at 3 AM?
      const backupFile = `backup_${new Date().toISOString()}.sql`;

      // Backup to the same region as primary - RED FLAG #4
      // Hurricane knocks out entire region = no backup
      await this.database.dumpToFile(`/backups/${backupFile}`);

      console.log(`Backup created: ${backupFile}`);

      // No verification that backup actually works - RED FLAG #5
      // How do we know this backup can actually be restored?

      // No automation - RED FLAG #6
      // Relies on someone remembering to run this
    } catch (error) {
      // No backup failure alerts - RED FLAG #7
      console.error("Backup failed:", error);
      // Problem? What problem? 🔥
    }
  }

  // "Recovery" strategy
  async recoverFromDisaster(): Promise<void> {
    try {
      // Manual recovery process - RED FLAG #8
      // Hope you remember the exact steps under pressure

      console.log("Starting disaster recovery...");
      console.log("Step 1: Find the latest backup (pray it exists)");
      console.log(
        "Step 2: Spin up new infrastructure (hopefully we wrote down how)"
      );
      console.log("Step 3: Restore database (cross fingers)");
      console.log("Step 4: Update DNS (wait 24-48 hours for propagation)");
      console.log("Step 5: Hope customers don't notice the data loss");

      // No RTO/RPO planning - RED FLAG #9
      // How long until we're back online? ¯\_(ツ)_/¯
      // How much data did we lose? ¯\_(ツ)_/¯
    } catch (error) {
      // This is fine 🔥
      throw new Error("Recovery failed. Update your resume.");
    }
  }

  // What actually happens during a disaster:
  // 1. Primary region goes down at 2 AM
  // 2. On-call engineer wakes up to 500 alerts
  // 3. Discover last backup was 3 days ago (and corrupted)
  // 4. Spend 8 hours trying to rebuild infrastructure from memory
  // 5. Finally get system online, missing 72 hours of customer data
  // 6. Customers flee to competitors while you're down
  // 7. Executives schedule "post-mortem" meetings
}

The Solution: Automated Disaster Recovery Architecture

// Enterprise disaster recovery with automatic failover
export class DisasterRecoveryOrchestrator {
  constructor(
    private primaryRegion: RegionManager,
    private secondaryRegion: RegionManager,
    private dnsFailover: DNSFailoverManager,
    private backupManager: BackupManager,
    private replicationManager: ReplicationManager,
    private monitoringSystem: DisasterMonitoringSystem,
    private runbookExecutor: RunbookExecutor
  ) {
    this.initializeDisasterRecovery();
  }

  private async initializeDisasterRecovery(): Promise<void> {
    // Set up continuous monitoring
    await this.setupDisasterMonitoring();

    // Initialize cross-region replication
    await this.replicationManager.initialize();

    // Schedule automated testing
    await this.scheduleDisasterRecoveryTests();

    console.log("Disaster recovery system initialized");
  }

  async handleRegionFailure(
    failedRegion: string,
    failureType: FailureType
  ): Promise<DisasterRecoveryResult> {
    const recoveryStartTime = Date.now();
    const incidentId = uuidv4();

    console.log(
      `🚨 DISASTER DETECTED: ${failureType} in region ${failedRegion}`
    );
    console.log(
      `🔄 Starting automated disaster recovery (Incident: ${incidentId})`
    );

    try {
      // Step 1: Assess damage and determine recovery strategy
      const assessmentResult = await this.assessDisasterScope(
        failedRegion,
        failureType
      );

      // Step 2: Execute appropriate recovery plan
      const recoveryPlan = this.selectRecoveryPlan(assessmentResult);
      const recoveryResult = await this.executeRecoveryPlan(
        recoveryPlan,
        incidentId
      );

      // Step 3: Verify recovery success
      await this.verifyRecoverySuccess(recoveryResult);

      // Step 4: Update monitoring and alerting
      await this.updatePostRecoveryMonitoring(recoveryResult);

      const totalRecoveryTime = Date.now() - recoveryStartTime;

      console.log(`✅ Disaster recovery completed in ${totalRecoveryTime}ms`);

      return new DisasterRecoveryResult(
        incidentId,
        failureType,
        recoveryResult.rtoAchieved,
        recoveryResult.rpoAchieved,
        totalRecoveryTime,
        "SUCCESS"
      );
    } catch (error) {
      console.error(
        `❌ Disaster recovery failed for incident ${incidentId}:`,
        error
      );

      // Escalate to human intervention
      await this.escalateToHumanIntervention(incidentId, failedRegion, error);

      throw new DisasterRecoveryFailedError(
        `Disaster recovery failed for incident ${incidentId}: ${error.message}`
      );
    }
  }

  private async assessDisasterScope(
    failedRegion: string,
    failureType: FailureType
  ): Promise<DisasterAssessment> {
    const assessment = new DisasterAssessment(failedRegion, failureType);

    // Check what services are affected
    const services = await this.primaryRegion.getAllServices();

    for (const service of services) {
      const serviceHealth = await this.checkServiceHealth(
        service,
        failedRegion
      );
      assessment.addServiceStatus(service.name, serviceHealth);
    }

    // Assess data consistency across regions
    const dataConsistency = await this.assessCrossRegionDataConsistency();
    assessment.setDataConsistency(dataConsistency);

    // Determine if this is a partial or complete regional failure
    assessment.determineFailureScope();

    return assessment;
  }

  private selectRecoveryPlan(assessment: DisasterAssessment): RecoveryPlan {
    switch (assessment.failureScope) {
      case FailureScope.SERVICE_LEVEL:
        return new ServiceLevelRecoveryPlan(assessment);

      case FailureScope.AVAILABILITY_ZONE:
        return new AvailabilityZoneRecoveryPlan(assessment);

      case FailureScope.REGION_LEVEL:
        return new RegionLevelRecoveryPlan(assessment);

      case FailureScope.MULTI_REGION:
        return new MultiRegionRecoveryPlan(assessment);

      default:
        throw new Error(`Unknown failure scope: ${assessment.failureScope}`);
    }
  }

  private async executeRecoveryPlan(
    plan: RecoveryPlan,
    incidentId: string
  ): Promise<RecoveryExecutionResult> {
    const execution = new RecoveryExecution(plan, incidentId);

    // Execute recovery steps in parallel where possible
    const recoverySteps = plan.getExecutionSteps();

    for (const step of recoverySteps) {
      console.log(`🔧 Executing recovery step: ${step.description}`);

      try {
        const stepResult = await this.executeRecoveryStep(step);
        execution.recordStepResult(step, stepResult);

        // Verify step succeeded before continuing
        if (!stepResult.success) {
          throw new RecoveryStepFailedError(
            `Recovery step failed: ${step.description} - ${stepResult.error}`
          );
        }
      } catch (error) {
        execution.recordStepError(step, error as Error);

        // Check if this step is critical for recovery
        if (step.isCritical) {
          throw error;
        } else {
          console.warn(
            `⚠️ Non-critical recovery step failed: ${step.description}`
          );
        }
      }
    }

    return execution.getResult();
  }

  private async executeRecoveryStep(
    step: RecoveryStep
  ): Promise<RecoveryStepResult> {
    switch (step.type) {
      case RecoveryStepType.DNS_FAILOVER:
        return await this.executeDNSFailover(step as DNSFailoverStep);

      case RecoveryStepType.DATABASE_FAILOVER:
        return await this.executeDatabaseFailover(step as DatabaseFailoverStep);

      case RecoveryStepType.APPLICATION_RESTART:
        return await this.executeApplicationRestart(
          step as ApplicationRestartStep
        );

      case RecoveryStepType.TRAFFIC_ROUTING:
        return await this.executeTrafficRouting(step as TrafficRoutingStep);

      case RecoveryStepType.DATA_SYNCHRONIZATION:
        return await this.executeDataSynchronization(
          step as DataSynchronizationStep
        );

      default:
        throw new Error(`Unknown recovery step type: ${step.type}`);
    }
  }

  private async executeDNSFailover(
    step: DNSFailoverStep
  ): Promise<RecoveryStepResult> {
    try {
      // Update DNS records to point to secondary region
      await this.dnsFailover.updateRecords({
        domain: step.domain,
        oldEndpoint: step.primaryEndpoint,
        newEndpoint: step.secondaryEndpoint,
        ttl: step.emergencyTTL, // Low TTL for faster propagation
      });

      // Verify DNS propagation
      const propagationStatus = await this.dnsFailover.verifyPropagation(
        step.domain,
        step.secondaryEndpoint,
        step.maxPropagationTimeMs
      );

      if (!propagationStatus.isComplete) {
        return new RecoveryStepResult(
          false,
          `DNS propagation incomplete: ${propagationStatus.percentage}%`
        );
      }

      return new RecoveryStepResult(
        true,
        "DNS failover completed successfully"
      );
    } catch (error) {
      return new RecoveryStepResult(
        false,
        `DNS failover failed: ${error.message}`
      );
    }
  }

  private async executeDatabaseFailover(
    step: DatabaseFailoverStep
  ): Promise<RecoveryStepResult> {
    try {
      // Promote read replica to primary in secondary region
      const promotionResult = await this.replicationManager.promoteReplica(
        step.sourceDatabase,
        step.targetDatabase
      );

      if (!promotionResult.success) {
        return new RecoveryStepResult(
          false,
          `Database promotion failed: ${promotionResult.error}`
        );
      }

      // Verify new primary is accepting writes
      const writeTest = await this.verifyDatabaseWriteAccess(
        step.targetDatabase
      );
      if (!writeTest) {
        return new RecoveryStepResult(
          false,
          "New primary database is not accepting writes"
        );
      }

      // Update application configuration to use new primary
      await this.updateDatabaseConfiguration(step.targetDatabase);

      return new RecoveryStepResult(
        true,
        `Database failover completed. RPO: ${promotionResult.dataLossMs}ms`
      );
    } catch (error) {
      return new RecoveryStepResult(
        false,
        `Database failover failed: ${error.message}`
      );
    }
  }

  async testDisasterRecovery(
    scenario: DisasterRecoveryScenario
  ): Promise<DisasterRecoveryTestResult> {
    console.log(`🧪 Starting disaster recovery test: ${scenario.name}`);

    const testStartTime = Date.now();
    const testId = uuidv4();

    try {
      // Create isolated test environment
      const testEnvironment = await this.createTestEnvironment(scenario);

      // Simulate disaster
      await this.simulateDisaster(testEnvironment, scenario.disasterType);

      // Execute recovery
      const recoveryResult = await this.handleRegionFailure(
        testEnvironment.region,
        scenario.disasterType
      );

      // Verify recovery
      const verificationResult = await this.verifyTestRecovery(
        testEnvironment,
        scenario
      );

      // Cleanup test environment
      await this.cleanupTestEnvironment(testEnvironment);

      const totalTestTime = Date.now() - testStartTime;

      return new DisasterRecoveryTestResult(
        testId,
        scenario.name,
        true,
        recoveryResult.rtoAchieved,
        recoveryResult.rpoAchieved,
        totalTestTime,
        verificationResult
      );
    } catch (error) {
      console.error(`❌ Disaster recovery test failed: ${error.message}`);

      return new DisasterRecoveryTestResult(
        testId,
        scenario.name,
        false,
        0,
        0,
        Date.now() - testStartTime,
        null,
        error.message
      );
    }
  }

  private async setupDisasterMonitoring(): Promise<void> {
    // Monitor region health
    this.monitoringSystem.addHealthCheck("region_connectivity", {
      interval: 30000, // Check every 30 seconds
      timeout: 10000,
      check: async () => {
        const primaryHealth = await this.primaryRegion.healthCheck();
        const secondaryHealth = await this.secondaryRegion.healthCheck();

        return {
          primary: primaryHealth,
          secondary: secondaryHealth,
          crossRegionLatency: await this.measureCrossRegionLatency(),
        };
      },
      onFailure: async (result) => {
        if (!result.primary.healthy) {
          await this.handleRegionFailure(
            this.primaryRegion.regionId,
            FailureType.REGION_OUTAGE
          );
        }
      },
    });

    // Monitor replication lag
    this.monitoringSystem.addHealthCheck("replication_lag", {
      interval: 60000, // Check every minute
      check: async () => {
        return await this.replicationManager.getReplicationStatus();
      },
      onFailure: async (status) => {
        if (status.lagMs > 30000) {
          // 30 second lag threshold
          console.warn(`⚠️ High replication lag detected: ${status.lagMs}ms`);
          // Trigger alert but don't failover yet
        }
      },
    });
  }

  private async scheduleDisasterRecoveryTests(): Promise<void> {
    // Schedule monthly full disaster recovery tests
    const monthlyTest = new DisasterRecoveryScenario(
      "monthly_full_region_failover",
      FailureType.REGION_OUTAGE,
      {
        rtoTarget: 300000, // 5 minutes
        rpoTarget: 60000, // 1 minute
        servicesInScope: ["all"],
      }
    );

    // Schedule weekly partial tests
    const weeklyTest = new DisasterRecoveryScenario(
      "weekly_service_failover",
      FailureType.SERVICE_OUTAGE,
      {
        rtoTarget: 120000, // 2 minutes
        rpoTarget: 10000, // 10 seconds
        servicesInScope: ["api", "database"],
      }
    );

    // In production, these would be scheduled with a proper scheduler
    console.log("Disaster recovery tests scheduled");
  }

  async generateDisasterRecoveryReport(): Promise<DisasterRecoveryReport> {
    const backupStatus = await this.backupManager.getBackupStatus();
    const replicationStatus =
      await this.replicationManager.getReplicationStatus();
    const testHistory = await this.getTestHistory();

    return new DisasterRecoveryReport(
      backupStatus,
      replicationStatus,
      testHistory,
      this.calculateRecoveryCapabilities()
    );
  }

  private calculateRecoveryCapabilities(): RecoveryCapabilities {
    // Calculate current RTO/RPO based on system configuration
    return {
      estimatedRTO: 180000, // 3 minutes
      estimatedRPO: 30000, // 30 seconds
      automaticFailoverEnabled: true,
      crossRegionReplicationEnabled: true,
      backupFrequency: "15_minutes",
      lastSuccessfulTest: new Date(),
      confidenceLevel: 0.95,
    };
  }
}

// Supporting types for disaster recovery
export enum FailureType {
  SERVICE_OUTAGE = "service_outage",
  AVAILABILITY_ZONE_FAILURE = "az_failure",
  REGION_OUTAGE = "region_outage",
  NETWORK_PARTITION = "network_partition",
  DATA_CORRUPTION = "data_corruption",
}

export enum FailureScope {
  SERVICE_LEVEL = "service",
  AVAILABILITY_ZONE = "az",
  REGION_LEVEL = "region",
  MULTI_REGION = "multi_region",
}

export enum RecoveryStepType {
  DNS_FAILOVER = "dns_failover",
  DATABASE_FAILOVER = "database_failover",
  APPLICATION_RESTART = "application_restart",
  TRAFFIC_ROUTING = "traffic_routing",
  DATA_SYNCHRONIZATION = "data_sync",
}

export class DisasterRecoveryResult {
  constructor(
    public incidentId: string,
    public failureType: FailureType,
    public rtoAchieved: number,
    public rpoAchieved: number,
    public totalRecoveryTime: number,
    public status: "SUCCESS" | "FAILED"
  ) {}
}

export class DisasterRecoveryTestResult {
  constructor(
    public testId: string,
    public scenarioName: string,
    public success: boolean,
    public rtoAchieved: number,
    public rpoAchieved: number,
    public totalTestTime: number,
    public verificationResult: any,
    public errorMessage?: string
  ) {}
}

export class DisasterRecoveryReport {
  constructor(
    public backupStatus: any,
    public replicationStatus: any,
    public testHistory: any[],
    public recoveryCapabilities: RecoveryCapabilities
  ) {}
}

export interface RecoveryCapabilities {
  estimatedRTO: number;
  estimatedRPO: number;
  automaticFailoverEnabled: boolean;
  crossRegionReplicationEnabled: boolean;
  backupFrequency: string;
  lastSuccessfulTest: Date;
  confidenceLevel: number;
}

Multi-Region Deployments: Global Resilience

The Problem: Single Region Dependency

// The single-region system that creates global outages
class SingleRegionService {
  private region = "us-east-1"; // The infamous single point of global failure

  constructor() {
    console.log(
      `🔥 All services running in ${this.region} - what could go wrong?`
    );
  }

  async serveGlobalUsers(userLocation: string): Promise<Response> {
    // Tokyo user hitting Virginia servers - RED FLAG #1
    // 200+ ms latency just from speed of light

    if (userLocation === "tokyo") {
      console.log("Tokyo user waiting 300ms for response from Virginia...");
    }

    if (userLocation === "sydney") {
      console.log("Sydney user waiting 400ms for response from Virginia...");
    }

    // All data stored in single region - RED FLAG #2
    const userData = await this.database.query(
      "SELECT * FROM users WHERE region = ?",
      [userLocation]
    );

    // No regional compliance handling - RED FLAG #3
    // Serving EU user data from US servers = GDPR violation
    if (userLocation.includes("eu-")) {
      console.log("⚖️ Potential GDPR violation: EU user data processed in US");
    }

    return new Response(userData, this.region);
  }

  // What happens during regional disasters:
  // 1. Hurricane hits us-east-1 (happens annually)
  // 2. Entire global user base can't access the service
  // 3. Users in Australia wait 30+ seconds for timeout
  // 4. Revenue drops to $0 worldwide until region recovers
  // 5. Competitors gain millions of users during outage
  // 6. Legal team deals with compliance violations
}

The Solution: Global Multi-Region Architecture

// Enterprise multi-region architecture with intelligent routing
export class GlobalMultiRegionService {
  private regions = new Map<string, RegionManager>();
  private globalRouter: GlobalTrafficRouter;
  private dataReplication: GlobalDataReplication;
  private crossRegionMonitoring: CrossRegionMonitoring;

  constructor(
    private regionConfigs: RegionConfig[],
    private routingStrategy: RoutingStrategy = RoutingStrategy.LATENCY_BASED
  ) {
    this.initializeRegions();
    this.setupGlobalRouting();
    this.startCrossRegionMonitoring();
  }

  async serveRequest(
    request: GlobalRequest,
    clientLocation: GeoLocation
  ): Promise<GlobalResponse> {
    const requestId = uuidv4();
    const startTime = Date.now();

    try {
      // Select optimal region for this request
      const optimalRegion = await this.selectOptimalRegion(
        clientLocation,
        request.requiresDataResidency
      );

      // Route request to selected region
      const response = await this.executeInRegion(
        optimalRegion,
        request,
        requestId
      );

      // Record metrics for routing optimization
      this.recordRoutingMetrics(
        clientLocation,
        optimalRegion,
        Date.now() - startTime
      );

      return new GlobalResponse(
        response.data,
        optimalRegion.regionId,
        Date.now() - startTime,
        requestId
      );
    } catch (error) {
      // Try failover to alternative region
      return await this.handleRegionalFailure(
        request,
        clientLocation,
        requestId,
        error as Error
      );
    }
  }

  private async selectOptimalRegion(
    clientLocation: GeoLocation,
    requiresDataResidency: boolean
  ): Promise<RegionManager> {
    // Filter regions based on data residency requirements
    let candidateRegions = Array.from(this.regions.values());

    if (requiresDataResidency) {
      candidateRegions = this.filterByDataResidency(
        candidateRegions,
        clientLocation
      );
    }

    // Filter out unhealthy regions
    candidateRegions = candidateRegions.filter((region) => region.isHealthy());

    if (candidateRegions.length === 0) {
      throw new NoHealthyRegionsError(
        "No healthy regions available for request"
      );
    }

    // Select based on routing strategy
    switch (this.routingStrategy) {
      case RoutingStrategy.LATENCY_BASED:
        return this.selectByLatency(candidateRegions, clientLocation);

      case RoutingStrategy.LOAD_BALANCED:
        return this.selectByLoad(candidateRegions);

      case RoutingStrategy.COST_OPTIMIZED:
        return this.selectByCost(candidateRegions, clientLocation);

      case RoutingStrategy.PERFORMANCE_OPTIMIZED:
        return this.selectByPerformance(candidateRegions, clientLocation);

      default:
        return candidateRegions[0]; // Fallback to first available
    }
  }

  private async selectByLatency(
    regions: RegionManager[],
    clientLocation: GeoLocation
  ): Promise<RegionManager> {
    // Calculate expected latency to each region
    const regionLatencies = await Promise.all(
      regions.map(async (region) => ({
        region,
        latency: await this.estimateLatency(clientLocation, region.location),
        load: region.getCurrentLoad(),
      }))
    );

    // Sort by latency, but consider load as tie-breaker
    regionLatencies.sort((a, b) => {
      const latencyDiff = a.latency - b.latency;
      if (Math.abs(latencyDiff) < 20) {
        // Within 20ms - consider load
        return a.load - b.load;
      }
      return latencyDiff;
    });

    return regionLatencies[0].region;
  }

  private async selectByLoad(regions: RegionManager[]): Promise<RegionManager> {
    // Select region with lowest current load
    return regions.reduce((best, current) => {
      const bestLoad = best.getCurrentLoad();
      const currentLoad = current.getCurrentLoad();

      // Factor in region capacity
      const bestUtilization = bestLoad / best.getMaxCapacity();
      const currentUtilization = currentLoad / current.getMaxCapacity();

      return currentUtilization < bestUtilization ? current : best;
    });
  }

  private async executeInRegion(
    region: RegionManager,
    request: GlobalRequest,
    requestId: string
  ): Promise<RegionResponse> {
    const regionStartTime = Date.now();

    try {
      // Execute request in selected region
      const response = await region.handleRequest(request, {
        requestId,
        routedFrom: "global-router",
        timestamp: new Date(),
      });

      // If this is a write operation, trigger cross-region replication
      if (request.isWrite()) {
        await this.triggerCrossRegionReplication(region, request, response);
      }

      return response;
    } catch (error) {
      const executionTime = Date.now() - regionStartTime;

      // Record regional failure
      this.crossRegionMonitoring.recordRegionFailure(
        region.regionId,
        error as Error,
        executionTime
      );

      throw error;
    }
  }

  private async handleRegionalFailure(
    request: GlobalRequest,
    clientLocation: GeoLocation,
    requestId: string,
    originalError: Error
  ): Promise<GlobalResponse> {
    console.warn(
      `Regional failure occurred, attempting failover for request ${requestId}`
    );

    try {
      // Get backup regions (excluding the failed region)
      const backupRegions = Array.from(this.regions.values())
        .filter((region) => region.isHealthy())
        .filter((region) => !this.wasRecentlyFailed(region.regionId));

      if (backupRegions.length === 0) {
        throw new GlobalServiceUnavailableError("All regions are unavailable");
      }

      // Select best backup region
      const backupRegion = await this.selectOptimalRegion(
        clientLocation,
        request.requiresDataResidency
      );

      // For read requests, try to serve from backup region
      if (!request.isWrite()) {
        const response = await this.executeInRegion(
          backupRegion,
          request,
          requestId
        );

        return new GlobalResponse(
          response.data,
          backupRegion.regionId,
          0, // Don't count failover time
          requestId,
          "FAILOVER_SUCCESS"
        );
      } else {
        // For write requests, check if we can safely execute in backup region
        const canSafelyWrite = await this.canSafelyWriteToBackup(
          request,
          backupRegion
        );

        if (canSafelyWrite) {
          const response = await this.executeInRegion(
            backupRegion,
            request,
            requestId
          );

          return new GlobalResponse(
            response.data,
            backupRegion.regionId,
            0,
            requestId,
            "FAILOVER_SUCCESS"
          );
        } else {
          // Queue write for later execution
          await this.queueWriteForLaterExecution(request, requestId);

          return new GlobalResponse(
            { queued: true },
            "queued",
            0,
            requestId,
            "QUEUED_FOR_RETRY"
          );
        }
      }
    } catch (failoverError) {
      // Both primary and failover failed
      throw new GlobalServiceUnavailableError(
        `Request failed in primary region: ${originalError.message}. ` +
          `Failover also failed: ${failoverError.message}`
      );
    }
  }

  private async triggerCrossRegionReplication(
    originRegion: RegionManager,
    request: GlobalRequest,
    response: RegionResponse
  ): Promise<void> {
    // Don't block the original request - replicate asynchronously
    setImmediate(async () => {
      try {
        const replicationTargets = Array.from(this.regions.values())
          .filter((region) => region !== originRegion)
          .filter((region) => region.isHealthy())
          .filter((region) => this.shouldReplicateTo(region, request));

        // Replicate to all target regions in parallel
        await Promise.all(
          replicationTargets.map((targetRegion) =>
            this.replicateToRegion(
              originRegion,
              targetRegion,
              request,
              response
            )
          )
        );
      } catch (error) {
        console.error("Cross-region replication failed:", error);
        // Don't fail the original request due to replication issues
      }
    });
  }

  private async replicateToRegion(
    sourceRegion: RegionManager,
    targetRegion: RegionManager,
    request: GlobalRequest,
    response: RegionResponse
  ): Promise<void> {
    const replicationData = {
      sourceRegion: sourceRegion.regionId,
      timestamp: new Date(),
      operation: request.operation,
      data: response.data,
      checksum: this.calculateChecksum(response.data),
    };

    try {
      await targetRegion.handleReplication(replicationData);

      this.crossRegionMonitoring.recordSuccessfulReplication(
        sourceRegion.regionId,
        targetRegion.regionId,
        Date.now()
      );
    } catch (error) {
      this.crossRegionMonitoring.recordFailedReplication(
        sourceRegion.regionId,
        targetRegion.regionId,
        error as Error
      );

      // Queue for retry
      await this.dataReplication.queueForRetry(
        sourceRegion.regionId,
        targetRegion.regionId,
        replicationData
      );
    }
  }

  // Data residency compliance
  private filterByDataResidency(
    regions: RegionManager[],
    clientLocation: GeoLocation
  ): RegionManager[] {
    const complianceRules = this.getDataResidencyRules(clientLocation);

    return regions.filter((region) => {
      // Check if region is compliant with data residency requirements
      for (const rule of complianceRules) {
        if (!rule.allows(region.location)) {
          return false;
        }
      }
      return true;
    });
  }

  private getDataResidencyRules(
    clientLocation: GeoLocation
  ): DataResidencyRule[] {
    const rules: DataResidencyRule[] = [];

    // GDPR compliance - EU data must stay in EU
    if (clientLocation.country && this.isEUCountry(clientLocation.country)) {
      rules.push(new GDPRComplianceRule());
    }

    // China data residency requirements
    if (clientLocation.country === "CN") {
      rules.push(new ChinaDataResidencyRule());
    }

    // Russian data localization
    if (clientLocation.country === "RU") {
      rules.push(new RussianDataLocalizationRule());
    }

    return rules;
  }

  async getGlobalSystemStatus(): Promise<GlobalSystemStatus> {
    const regionStatuses = await Promise.all(
      Array.from(this.regions.values()).map(async (region) => ({
        regionId: region.regionId,
        healthy: region.isHealthy(),
        load: region.getCurrentLoad(),
        capacity: region.getMaxCapacity(),
        latency: await region.getAverageLatency(),
        replicationLag: await this.dataReplication.getReplicationLag(
          region.regionId
        ),
      }))
    );

    const globalMetrics = await this.crossRegionMonitoring.getGlobalMetrics();

    return new GlobalSystemStatus(
      regionStatuses,
      globalMetrics,
      this.calculateGlobalHealthScore(regionStatuses)
    );
  }

  private calculateGlobalHealthScore(regionStatuses: any[]): number {
    const healthyRegions = regionStatuses.filter((r) => r.healthy).length;
    const totalRegions = regionStatuses.length;

    if (totalRegions === 0) return 0;

    const baseScore = healthyRegions / totalRegions;

    // Factor in load distribution
    const avgLoad =
      regionStatuses.reduce((sum, r) => sum + r.load, 0) / totalRegions;
    const loadFactor = Math.max(0, 1 - avgLoad / 100); // Penalty for high load

    // Factor in replication health
    const maxReplicationLag = Math.max(
      ...regionStatuses.map((r) => r.replicationLag || 0)
    );
    const replicationFactor = Math.max(0, 1 - maxReplicationLag / 60000); // Penalty for lag > 1 minute

    return baseScore * 0.6 + loadFactor * 0.2 + replicationFactor * 0.2;
  }

  private initializeRegions(): void {
    for (const config of this.regionConfigs) {
      const regionManager = new RegionManager(config);
      this.regions.set(config.regionId, regionManager);
    }
  }

  private setupGlobalRouting(): void {
    this.globalRouter = new GlobalTrafficRouter(
      this.regions,
      this.routingStrategy
    );
  }

  private startCrossRegionMonitoring(): void {
    this.crossRegionMonitoring = new CrossRegionMonitoring(this.regions);
    this.crossRegionMonitoring.startMonitoring();
  }
}

// Global traffic router with intelligent routing
export class GlobalTrafficRouter {
  private routingCache = new Map<string, RoutingDecision>();
  private performanceMetrics = new Map<string, PerformanceMetrics>();

  constructor(
    private regions: Map<string, RegionManager>,
    private strategy: RoutingStrategy
  ) {}

  async routeRequest(
    request: GlobalRequest,
    clientLocation: GeoLocation
  ): Promise<RoutingDecision> {
    const cacheKey = this.generateCacheKey(request, clientLocation);

    // Check if we have a recent routing decision
    const cachedDecision = this.routingCache.get(cacheKey);
    if (cachedDecision && !this.isStale(cachedDecision)) {
      return cachedDecision;
    }

    // Calculate new routing decision
    const decision = await this.calculateOptimalRouting(
      request,
      clientLocation
    );

    // Cache decision for future requests
    this.routingCache.set(cacheKey, decision);

    return decision;
  }

  private async calculateOptimalRouting(
    request: GlobalRequest,
    clientLocation: GeoLocation
  ): Promise<RoutingDecision> {
    const candidateRegions = await this.getCandidateRegions(
      request,
      clientLocation
    );
    const routingScores = await this.calculateRoutingScores(
      candidateRegions,
      clientLocation
    );

    // Sort by score (highest first)
    routingScores.sort((a, b) => b.score - a.score);

    const primaryRegion = routingScores[0].region;
    const fallbackRegions = routingScores.slice(1, 3).map((r) => r.region); // Top 2 fallbacks

    return new RoutingDecision(
      primaryRegion,
      fallbackRegions,
      routingScores[0].score,
      this.strategy
    );
  }

  private async calculateRoutingScores(
    regions: RegionManager[],
    clientLocation: GeoLocation
  ): Promise<RoutingScore[]> {
    return Promise.all(
      regions.map(async (region) => {
        let score = 100; // Base score

        // Factor 1: Latency (40% weight)
        const estimatedLatency = await this.estimateLatency(
          clientLocation,
          region.location
        );
        const latencyScore = Math.max(0, 100 - estimatedLatency / 10); // 10ms = 1 point penalty
        score = score * 0.4 + latencyScore * 0.4;

        // Factor 2: Current load (30% weight)
        const loadPercentage =
          (region.getCurrentLoad() / region.getMaxCapacity()) * 100;
        const loadScore = Math.max(0, 100 - loadPercentage);
        score = score * 0.7 + loadScore * 0.3;

        // Factor 3: Historical reliability (20% weight)
        const reliability = await this.getRegionReliability(region.regionId);
        score = score * 0.8 + reliability * 0.2;

        // Factor 4: Cost (10% weight)
        const costScore = this.calculateCostScore(region, clientLocation);
        score = score * 0.9 + costScore * 0.1;

        return new RoutingScore(region, score, {
          latency: estimatedLatency,
          load: loadPercentage,
          reliability,
          cost: costScore,
        });
      })
    );
  }

  updatePerformanceMetrics(
    regionId: string,
    latency: number,
    success: boolean
  ): void {
    let metrics = this.performanceMetrics.get(regionId);
    if (!metrics) {
      metrics = new PerformanceMetrics();
      this.performanceMetrics.set(regionId, metrics);
    }

    metrics.recordRequest(latency, success);
  }

  private generateCacheKey(
    request: GlobalRequest,
    clientLocation: GeoLocation
  ): string {
    return `${request.type}_${clientLocation.country}_${request.requiresDataResidency}`;
  }

  private isStale(decision: RoutingDecision): boolean {
    const ageMs = Date.now() - decision.timestamp.getTime();
    return ageMs > 300000; // 5 minutes
  }
}

// Supporting types for multi-region architecture
export enum RoutingStrategy {
  LATENCY_BASED = "latency",
  LOAD_BALANCED = "load",
  COST_OPTIMIZED = "cost",
  PERFORMANCE_OPTIMIZED = "performance",
}

export class GlobalRequest {
  constructor(
    public type: string,
    public operation: string,
    public data: any,
    public requiresDataResidency: boolean = false
  ) {}

  isWrite(): boolean {
    return ["CREATE", "UPDATE", "DELETE"].includes(
      this.operation.toUpperCase()
    );
  }
}

export class GlobalResponse {
  constructor(
    public data: any,
    public handledByRegion: string,
    public totalLatency: number,
    public requestId: string,
    public status: string = "SUCCESS"
  ) {}
}

export class RoutingDecision {
  constructor(
    public primaryRegion: RegionManager,
    public fallbackRegions: RegionManager[],
    public score: number,
    public strategy: RoutingStrategy,
    public timestamp: Date = new Date()
  ) {}
}

export class RoutingScore {
  constructor(
    public region: RegionManager,
    public score: number,
    public factors: {
      latency: number;
      load: number;
      reliability: number;
      cost: number;
    }
  ) {}
}

export class GlobalSystemStatus {
  constructor(
    public regionStatuses: any[],
    public globalMetrics: any,
    public healthScore: number
  ) {}
}

// Data residency compliance rules
export abstract class DataResidencyRule {
  abstract allows(location: GeoLocation): boolean;
  abstract getName(): string;
}

export class GDPRComplianceRule extends DataResidencyRule {
  allows(location: GeoLocation): boolean {
    return this.isEURegion(location);
  }

  getName(): string {
    return "GDPR_EU_DATA_RESIDENCY";
  }

  private isEURegion(location: GeoLocation): boolean {
    const euCountries = [
      "DE",
      "FR",
      "IT",
      "ES",
      "NL",
      "BE",
      "IE",
      "AT",
      "FI",
      "SE",
      "DK",
    ]; // etc.
    return euCountries.includes(location.country);
  }
}

export class ChinaDataResidencyRule extends DataResidencyRule {
  allows(location: GeoLocation): boolean {
    return location.country === "CN";
  }

  getName(): string {
    return "CHINA_DATA_LOCALIZATION";
  }
}

export class PerformanceMetrics {
  private requestCount = 0;
  private successCount = 0;
  private totalLatency = 0;
  private latencyHistory: number[] = [];

  recordRequest(latency: number, success: boolean): void {
    this.requestCount++;
    this.totalLatency += latency;

    if (success) {
      this.successCount++;
    }

    this.latencyHistory.push(latency);
    if (this.latencyHistory.length > 100) {
      this.latencyHistory.shift(); // Keep only last 100 measurements
    }
  }

  getAverageLatency(): number {
    return this.requestCount > 0 ? this.totalLatency / this.requestCount : 0;
  }

  getSuccessRate(): number {
    return this.requestCount > 0 ? this.successCount / this.requestCount : 1;
  }

  getP95Latency(): number {
    if (this.latencyHistory.length === 0) return 0;

    const sorted = [...this.latencyHistory].sort((a, b) => a - b);
    const index = Math.floor(sorted.length * 0.95);
    return sorted[index];
  }
}

The Bottom Line: Build Once, Scale Forever

The patterns you’ve just absorbed aren’t just academic theory—they’re the exact same strategies that keep Netflix streaming during natural disasters, ensure Amazon stays online during Prime Day, and allow Google to serve billions of users simultaneously across the globe.

Here’s the uncomfortable truth: The difference between companies that survive explosive growth and those that crumble under success isn’t talent or funding—it’s whether they built for scale and high availability from the beginning.

Key Takeaways:

  • High availability isn’t about preventing failures—it’s about making failures invisible to users through intelligent fallbacks and circuit breakers
  • Disaster recovery isn’t a plan—it’s an automated system that you test constantly, not something you figure out when the data center catches fire
  • Multi-region isn’t just for global companies—even local businesses need geographic distribution because your “local” data center will go down at the worst possible moment

The patterns in this series—CQRS, Event Sourcing, Saga orchestration, intelligent caching, global load balancing—these aren’t separate tools you pick and choose from. They’re pieces of a coherent system that becomes more resilient as it grows larger.

What comes next? We’ve covered scalability and high availability. The next phase of your architecture education covers Performance & Optimization—because building a system that can scale to millions of users is meaningless if each request takes 10 seconds to complete.