Backend Monitoring & Observability

Oct 11, 2025

The $12.5 Million “Everything is Fine” Monitoring Disaster That Broke the Internet

Picture this observability nightmare: A major streaming platform with 300 million subscribers launches their “next-generation” monitoring system just before the season finale of their most popular show. Their DevOps team, fresh from a Kubernetes conference, implements every trendy observability tool they could find: Prometheus, Grafana, Jaeger, ELK stack, and custom dashboards showing green checkmarks everywhere.

Sunday night, 100 million users try to watch the finale simultaneously. The monitoring dashboard shows everything is “operational” while users experience complete service outages.

The symptoms were catastrophically expensive:

Revenue loss hit $12.5 million in 4 hours: Premium subscribers cancelled en masse after missing the season finale
Customer service received 2.3 million complaints: Users couldn’t stream anything while metrics showed “99.9% uptime”
Social media exploded with 847,000 angry posts: #StreamingDown trended worldwide while their status page showed “All Systems Operational”
Database queries averaged 47 seconds: Critical user authentication was failing while CPU metrics looked normal
CDN served 404s to 67% of requests: Video content wasn’t loading while bandwidth graphs showed healthy traffic
Memory leaks consumed 94% of available RAM: Application servers were thrashing while memory alerts never fired

Here’s what their expensive monitoring post-mortem revealed:

Vanity metrics instead of user impact: They monitored CPU and memory but ignored actual user experience and business transactions
Alert fatigue from irrelevant notifications: 15,000+ alerts per day about non-critical issues, causing engineers to ignore all notifications
No distributed tracing across services: They couldn’t track user requests through their 247 microservices architecture
Monitoring the monitoring tools: They spent more time fixing Grafana dashboards than actual application issues
Missing business logic observability: They tracked infrastructure but had zero visibility into authentication failures, payment processing, or content delivery
Reactive instead of predictive monitoring: Every alert fired after users were already impacted, not before problems occurred

The final damage:

$12.5 million in lost revenue from cancellations and refunds during a single evening
67% drop in user trust scores as subscribers lost faith in service reliability
8 months of engineering time rebuilding their entire observability infrastructure from scratch
Complete executive turnover in the engineering organization after the board intervention
Regulatory investigation as the outage affected emergency broadcast capabilities

The brutal truth? They had comprehensive monitoring that monitored everything except what actually mattered to users and the business.

The Uncomfortable Truth About Monitoring and Observability

Here’s what separates monitoring systems that prevent disasters from those that hide them: True observability isn’t about collecting more data—it’s about understanding user impact, business outcomes, and system behavior patterns. The more metrics you collect without context, the less visibility you actually have into what matters.

Most developers approach monitoring like this:

Install popular monitoring tools and collect every metric available
Create beautiful dashboards that show technical metrics instead of user experience
Set up alerts based on arbitrary thresholds without understanding user impact
React to problems after users are already affected
Focus on infrastructure metrics while ignoring business logic and user journeys

But developers who build truly observable systems work differently:

Monitor user experience first, infrastructure second by tracking real user interactions and business transactions
Implement distributed tracing to understand how requests flow through complex microservices architectures
Use contextual alerting that correlates multiple signals to reduce noise and focus on actual problems
Build predictive monitoring that identifies problems before users are impacted
Instrument business logic to understand not just if systems are running, but if they’re delivering value

The difference isn’t just knowing when problems occur—it’s understanding why they happen and preventing them from affecting users in the first place.

Ready to build monitoring systems that actually help you deliver reliable software instead of creating a false sense of security? Let’s dive into observability patterns that work in production.

Foundation of Effective Observability: The Three Pillars Done Right

The Problem: Monitoring Everything While Seeing Nothing

// The observability nightmare that creates more problems than it solves
const express = require("express");
const prometheus = require("prom-client");

// Collecting every metric imaginable - RED FLAG #1
const httpRequestsTotal = new prometheus.Counter({
  name: "http_requests_total",
  help: "Total number of HTTP requests",
  labelNames: [
    "method",
    "route",
    "status_code",
    "user_agent",
    "ip",
    "referrer",
  ],
});

const httpRequestDuration = new prometheus.Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request duration in seconds",
  buckets: [0.1, 0.2, 0.3, 0.4, 0.5, 1, 2, 5, 10, 30], // Too many buckets - RED FLAG #2
  labelNames: ["method", "route", "status_code"],
});

const memoryUsage = new prometheus.Gauge({
  name: "process_memory_usage_bytes",
  help: "Process memory usage",
  collect() {
    const usage = process.memoryUsage();
    this.set(usage.heapUsed);
    this.set({ type: "rss" }, usage.rss);
    this.set({ type: "heapTotal" }, usage.heapTotal);
    this.set({ type: "external" }, usage.external);
  },
});

// Logging everything without structure - RED FLAG #3
app.use((req, res, next) => {
  console.log(`${new Date().toISOString()} ${req.method} ${req.url}`);
  console.log("Headers:", req.headers);
  console.log("Body:", req.body);
  console.log("Query:", req.query);
  console.log("User Agent:", req.get("User-Agent"));
  console.log("IP:", req.ip);

  // Increment metrics for every request - RED FLAG #4
  httpRequestsTotal.inc({
    method: req.method,
    route: req.route?.path || "unknown",
    status_code: res.statusCode,
    user_agent: req.get("User-Agent"),
    ip: req.ip,
    referrer: req.get("Referrer") || "none",
  });

  next();
});

// Business logic with no observability - RED FLAG #5
app.post("/api/orders", async (req, res) => {
  try {
    // No tracing of business operations
    const user = await User.findById(req.body.userId);
    const product = await Product.findById(req.body.productId);
    const inventory = await checkInventory(product.id);
    const payment = await processPayment(req.body.paymentInfo);
    const order = await createOrder(user, product, payment);

    res.json({ success: true, orderId: order.id });
  } catch (error) {
    // Generic error logging - RED FLAG #6
    console.error("Order creation failed:", error.message);
    res.status(500).json({ error: "Internal server error" });
  }
});

// Alert on everything - RED FLAG #7
const alertRules = [
  "cpu_usage > 50%", // Too sensitive
  "memory_usage > 60%", // No context
  "disk_usage > 70%", // Arbitrary threshold
  "http_requests_total > 100", // No time window context
  "error_count > 0", // Alert fatigue guaranteed
];

// Problems this creates:
// - High-cardinality metrics explode storage costs and query times
// - Logs are unstructured and impossible to search effectively
// - No connection between metrics, traces, and logs
// - Alerts fire constantly for non-critical issues
// - No understanding of user impact or business outcomes
// - Missing context about why problems occur

The Solution: Strategic Observability with User-Centric Monitoring

// Production-ready observability with proper instrumentation
import express from "express";
import { Counter, Histogram, Gauge, register } from "prom-client";
import { trace, context, SpanStatusCode } from "@opentelemetry/api";
import { NodeTracing } from "@opentelemetry/auto-instrumentation-node";
import { createLogger, format, transports } from "winston";
import { correlationId } from "./middleware/correlation";
import { JaegerExporter } from "@opentelemetry/exporter-jaeger";

// Strategic metric collection focused on user experience
export class ObservabilityEngine {
  private static instance: ObservabilityEngine;

  // Business-focused metrics (RED method: Rate, Errors, Duration)
  private readonly businessTransactionRate = new Counter({
    name: "business_transactions_total",
    help: "Total business transactions by type and outcome",
    labelNames: ["transaction_type", "outcome", "customer_segment"],
  });

  private readonly businessTransactionDuration = new Histogram({
    name: "business_transaction_duration_seconds",
    help: "Business transaction duration by type",
    buckets: [0.1, 0.3, 0.5, 1, 2, 5, 10], // Focused buckets based on SLA
    labelNames: ["transaction_type", "customer_segment"],
  });

  private readonly userExperienceMetrics = new Histogram({
    name: "user_experience_response_time",
    help: "User-perceived response time",
    buckets: [0.1, 0.2, 0.5, 1, 2, 3], // Based on user experience research
    labelNames: ["endpoint", "user_tier"],
  });

  // Infrastructure metrics (USE method: Utilization, Saturation, Errors)
  private readonly systemUtilization = new Gauge({
    name: "system_utilization_ratio",
    help: "System resource utilization",
    labelNames: ["resource_type", "instance"],
  });

  private readonly systemSaturation = new Gauge({
    name: "system_saturation_ratio",
    help: "System resource saturation",
    labelNames: ["resource_type", "instance"],
  });

  // Structured logging with correlation
  private readonly logger = createLogger({
    format: format.combine(
      format.timestamp(),
      format.errors({ stack: true }),
      format.json(),
      format((info) => {
        // Add correlation ID to all logs
        info.correlationId = correlationId.getStore()?.correlationId;
        info.traceId = trace.getActiveSpan()?.spanContext().traceId;
        info.spanId = trace.getActiveSpan()?.spanContext().spanId;
        return info;
      })()
    ),
    transports: [
      new transports.Console(),
      new transports.File({
        filename: "logs/error.log",
        level: "error",
        maxsize: 50 * 1024 * 1024, // 50MB
        maxFiles: 10,
      }),
      new transports.File({
        filename: "logs/business.log",
        level: "info",
        maxsize: 100 * 1024 * 1024, // 100MB
        maxFiles: 20,
      }),
    ],
  });

  static getInstance(): ObservabilityEngine {
    if (!ObservabilityEngine.instance) {
      ObservabilityEngine.instance = new ObservabilityEngine();
    }
    return ObservabilityEngine.instance;
  }

  // Business transaction instrumentation
  instrumentBusinessTransaction<T>(
    transactionType: string,
    customerSegment: string,
    operation: () => Promise<T>
  ): Promise<T> {
    return this.withBusinessSpan(transactionType, async (span) => {
      const timer = this.businessTransactionDuration.startTimer({
        transaction_type: transactionType,
        customer_segment: customerSegment,
      });

      try {
        span.setAttributes({
          "business.transaction.type": transactionType,
          "business.customer.segment": customerSegment,
          "business.transaction.id":
            correlationId.getStore()?.correlationId || "unknown",
        });

        const result = await operation();

        // Record successful business transaction
        this.businessTransactionRate.inc({
          transaction_type: transactionType,
          outcome: "success",
          customer_segment: customerSegment,
        });

        span.setStatus({ code: SpanStatusCode.OK });

        this.logger.info("Business transaction completed", {
          transactionType,
          customerSegment,
          outcome: "success",
          duration: timer(),
        });

        return result;
      } catch (error: any) {
        // Record failed business transaction with context
        this.businessTransactionRate.inc({
          transaction_type: transactionType,
          outcome: "failure",
          customer_segment: customerSegment,
        });

        span.recordException(error);
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: error.message,
        });

        this.logger.error("Business transaction failed", {
          transactionType,
          customerSegment,
          error: error.message,
          errorCode: error.code,
          stack: error.stack,
          duration: timer(),
        });

        throw error;
      }
    });
  }

  // User experience monitoring
  trackUserExperience(
    endpoint: string,
    userTier: string,
    responseTime: number,
    success: boolean
  ): void {
    this.userExperienceMetrics.observe(
      { endpoint, user_tier: userTier },
      responseTime / 1000 // Convert to seconds
    );

    this.logger.info("User experience tracked", {
      endpoint,
      userTier,
      responseTime,
      success,
      experienceGrade: this.calculateExperienceGrade(responseTime),
    });
  }

  // System health monitoring
  updateSystemMetrics(): void {
    const usage = process.memoryUsage();
    const cpuUsage = process.cpuUsage();

    // Memory utilization
    this.systemUtilization.set(
      { resource_type: "memory", instance: process.env.HOSTNAME || "unknown" },
      usage.heapUsed / usage.heapTotal
    );

    // Memory saturation (approaching limits)
    this.systemSaturation.set(
      { resource_type: "memory", instance: process.env.HOSTNAME || "unknown" },
      usage.heapUsed / (usage.heapTotal * 0.8) // Alert at 80% of heap
    );

    this.logger.debug("System metrics updated", {
      memory: {
        heapUsed: usage.heapUsed,
        heapTotal: usage.heapTotal,
        utilization: usage.heapUsed / usage.heapTotal,
      },
    });
  }

  // Distributed tracing helpers
  private async withBusinessSpan<T>(
    name: string,
    operation: (span: any) => Promise<T>
  ): Promise<T> {
    const tracer = trace.getTracer("business-operations");
    return tracer.startActiveSpan(name, async (span) => {
      try {
        return await operation(span);
      } finally {
        span.end();
      }
    });
  }

  // Alert context enrichment
  enrichAlertContext(alertData: AlertData): EnrichedAlert {
    const correlationId = correlationId.getStore()?.correlationId;
    const activeSpan = trace.getActiveSpan();

    return {
      ...alertData,
      context: {
        correlationId,
        traceId: activeSpan?.spanContext().traceId,
        spanId: activeSpan?.spanContext().spanId,
        businessContext: this.getBusinessContext(),
        userImpact: this.calculateUserImpact(alertData),
        suggestedActions: this.getSuggestedActions(alertData),
      },
    };
  }

  private calculateExperienceGrade(responseTime: number): string {
    if (responseTime < 200) return "excellent";
    if (responseTime < 500) return "good";
    if (responseTime < 1000) return "fair";
    if (responseTime < 2000) return "poor";
    return "unacceptable";
  }

  private getBusinessContext(): any {
    // Would integrate with business context provider
    return {
      currentPromotions: ["black-friday-sale"],
      peakHours: this.isDuringPeakHours(),
      maintenanceWindows: [],
    };
  }

  private calculateUserImpact(alertData: AlertData): UserImpactAssessment {
    // Sophisticated user impact calculation based on alert type and context
    const impactFactors = {
      "payment-processing-error": {
        severity: "critical",
        affectedUsers: "all-paying",
      },
      "authentication-failure": {
        severity: "high",
        affectedUsers: "all-users",
      },
      "search-slowdown": { severity: "medium", affectedUsers: "search-users" },
      "recommendation-error": {
        severity: "low",
        affectedUsers: "browsing-users",
      },
    };

    return (
      impactFactors[alertData.type] || {
        severity: "unknown",
        affectedUsers: "unknown",
      }
    );
  }

  private getSuggestedActions(alertData: AlertData): string[] {
    // Runbook automation suggestions
    const actionMap: Record<string, string[]> = {
      "high-cpu": [
        "Check for memory leaks in recent deployments",
        "Scale horizontally if sustained high load",
        "Review slow queries in the last hour",
      ],
      "payment-errors": [
        "Check payment provider status",
        "Verify payment service database connections",
        "Review recent payment service deployments",
        "Activate backup payment processor if needed",
      ],
      "database-slow": [
        "Check for long-running queries",
        "Verify index utilization",
        "Check database connection pool status",
        "Consider read replica failover",
      ],
    };

    return (
      actionMap[alertData.type] || ["Check system logs and recent changes"]
    );
  }

  private isDuringPeakHours(): boolean {
    const hour = new Date().getHours();
    return hour >= 18 && hour <= 23; // 6 PM to 11 PM
  }
}

// Comprehensive application instrumentation
export class ApplicationInstrumentation {
  private observability: ObservabilityEngine;

  constructor() {
    this.observability = ObservabilityEngine.getInstance();
    this.setupSystemMetricsCollection();
  }

  // Express middleware for request instrumentation
  requestInstrumentation() {
    return async (
      req: express.Request,
      res: express.Response,
      next: express.NextFunction
    ) => {
      const startTime = Date.now();
      const correlationId =
        req.headers["x-correlation-id"] || this.generateCorrelationId();

      // Set correlation context for the entire request
      correlationId.run({ correlationId }, async () => {
        const tracer = trace.getTracer("http-requests");

        await tracer.startActiveSpan(
          `${req.method} ${req.path}`,
          async (span) => {
            span.setAttributes({
              "http.method": req.method,
              "http.url": req.url,
              "http.user_agent": req.get("User-Agent") || "",
              "user.tier": req.user?.tier || "anonymous",
              "request.correlation_id": correlationId,
            });

            try {
              res.on("finish", () => {
                const duration = Date.now() - startTime;
                const userTier = req.user?.tier || "anonymous";

                span.setAttributes({
                  "http.status_code": res.statusCode,
                  "http.response_size": res.get("content-length") || 0,
                });

                // Track user experience
                this.observability.trackUserExperience(
                  req.path,
                  userTier,
                  duration,
                  res.statusCode < 400
                );

                if (res.statusCode >= 400) {
                  span.setStatus({
                    code: SpanStatusCode.ERROR,
                    message: `HTTP ${res.statusCode}`,
                  });
                }

                span.end();
              });

              next();
            } catch (error: any) {
              span.recordException(error);
              span.setStatus({
                code: SpanStatusCode.ERROR,
                message: error.message,
              });
              next(error);
            }
          }
        );
      });
    };
  }

  // Database operation instrumentation
  instrumentDatabaseOperation<T>(
    operation: string,
    table: string,
    query: () => Promise<T>
  ): Promise<T> {
    return this.observability.instrumentBusinessTransaction(
      `database.${operation}`,
      "system",
      async () => {
        const tracer = trace.getTracer("database");

        return tracer.startActiveSpan(
          `db.${operation}.${table}`,
          async (span) => {
            span.setAttributes({
              "db.operation": operation,
              "db.table": table,
              "db.type": "postgresql",
            });

            try {
              const result = await query();
              span.setStatus({ code: SpanStatusCode.OK });
              return result;
            } catch (error: any) {
              span.recordException(error);
              span.setStatus({
                code: SpanStatusCode.ERROR,
                message: error.message,
              });
              throw error;
            } finally {
              span.end();
            }
          }
        );
      }
    );
  }

  // External service call instrumentation
  instrumentExternalCall<T>(
    serviceName: string,
    operation: string,
    call: () => Promise<T>
  ): Promise<T> {
    return this.observability.instrumentBusinessTransaction(
      `external.${serviceName}.${operation}`,
      "system",
      async () => {
        const tracer = trace.getTracer("external-services");

        return tracer.startActiveSpan(
          `external.${serviceName}.${operation}`,
          async (span) => {
            span.setAttributes({
              "service.name": serviceName,
              "service.operation": operation,
              "service.type": "external",
            });

            const startTime = Date.now();

            try {
              const result = await call();
              const duration = Date.now() - startTime;

              span.setAttributes({
                "service.duration_ms": duration,
                "service.success": true,
              });

              span.setStatus({ code: SpanStatusCode.OK });
              return result;
            } catch (error: any) {
              const duration = Date.now() - startTime;

              span.setAttributes({
                "service.duration_ms": duration,
                "service.success": false,
                "service.error": error.message,
              });

              span.recordException(error);
              span.setStatus({
                code: SpanStatusCode.ERROR,
                message: error.message,
              });
              throw error;
            } finally {
              span.end();
            }
          }
        );
      }
    );
  }

  private setupSystemMetricsCollection(): void {
    // Collect system metrics every 15 seconds
    setInterval(() => {
      this.observability.updateSystemMetrics();
    }, 15000);
  }

  private generateCorrelationId(): string {
    return `req_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
  }
}

// Business-specific monitoring
export class BusinessMonitoring {
  private observability: ObservabilityEngine;

  constructor() {
    this.observability = ObservabilityEngine.getInstance();
  }

  // E-commerce specific monitoring
  async monitorOrderProcessing(orderData: OrderData): Promise<void> {
    await this.observability.instrumentBusinessTransaction(
      "order.create",
      orderData.customerSegment,
      async () => {
        const tracer = trace.getTracer("business.orders");

        await tracer.startActiveSpan("process.order", async (span) => {
          span.setAttributes({
            "order.id": orderData.id,
            "order.value": orderData.totalValue,
            "order.items_count": orderData.items.length,
            "customer.segment": orderData.customerSegment,
            "customer.tier": orderData.customerTier,
          });

          // Track business-specific metrics
          this.trackBusinessMetrics("order_created", {
            value: orderData.totalValue,
            customerSegment: orderData.customerSegment,
            itemsCount: orderData.items.length,
          });

          span.end();
        });
      }
    );
  }

  // Payment processing monitoring
  async monitorPaymentProcessing(paymentData: PaymentData): Promise<void> {
    await this.observability.instrumentBusinessTransaction(
      "payment.process",
      paymentData.customerSegment,
      async () => {
        const tracer = trace.getTracer("business.payments");

        await tracer.startActiveSpan("process.payment", async (span) => {
          span.setAttributes({
            "payment.method": paymentData.method,
            "payment.amount": paymentData.amount,
            "payment.currency": paymentData.currency,
            "payment.processor": paymentData.processor,
          });

          this.trackBusinessMetrics("payment_processed", {
            amount: paymentData.amount,
            method: paymentData.method,
            processor: paymentData.processor,
          });

          span.end();
        });
      }
    );
  }

  private trackBusinessMetrics(eventType: string, data: any): void {
    const logger = createLogger({ transports: [new transports.Console()] });

    logger.info("Business event tracked", {
      eventType,
      ...data,
      timestamp: new Date().toISOString(),
    });
  }
}

// Supporting interfaces
interface AlertData {
  type: string;
  severity: "low" | "medium" | "high" | "critical";
  message: string;
  timestamp: Date;
  source: string;
}

interface EnrichedAlert extends AlertData {
  context: {
    correlationId?: string;
    traceId?: string;
    spanId?: string;
    businessContext: any;
    userImpact: UserImpactAssessment;
    suggestedActions: string[];
  };
}

interface UserImpactAssessment {
  severity: "low" | "medium" | "high" | "critical" | "unknown";
  affectedUsers: string;
}

interface OrderData {
  id: string;
  totalValue: number;
  items: any[];
  customerSegment: string;
  customerTier: string;
}

interface PaymentData {
  method: string;
  amount: number;
  currency: string;
  processor: string;
  customerSegment: string;
}

Distributed Tracing: Understanding Complex System Interactions

The Problem: Black Box Microservices with No Visibility

// The distributed tracing nightmare - no correlation between services
// Service A: User Authentication Service
app.post("/auth/login", async (req, res) => {
  try {
    // No trace correlation - RED FLAG #1
    console.log("Login attempt started");

    const user = await userService.validateCredentials(
      req.body.email,
      req.body.password
    );

    // Calling external service with no tracing - RED FLAG #2
    const permissions = await fetch(
      "http://permission-service/api/permissions/" + user.id
    );

    // No context propagation - RED FLAG #3
    const token = await tokenService.generateToken(user.id);

    console.log("Login successful");
    res.json({ token, user: user.id });
  } catch (error) {
    // No trace correlation in error - RED FLAG #4
    console.error("Login failed:", error.message);
    res.status(401).json({ error: "Authentication failed" });
  }
});

// Service B: Permission Service
app.get("/api/permissions/:userId", async (req, res) => {
  try {
    // No incoming trace context - RED FLAG #5
    console.log("Fetching permissions for user:", req.params.userId);

    // Database call with no tracing - RED FLAG #6
    const permissions = await db.query(
      "SELECT * FROM permissions WHERE user_id = ?",
      [req.params.userId]
    );

    // Calling another service - RED FLAG #7
    const roleData = await fetch(
      "http://role-service/api/roles/" + permissions.roleId
    );

    res.json(permissions);
  } catch (error) {
    console.error("Permission fetch failed:", error.message);
    res.status(500).json({ error: "Internal server error" });
  }
});

// Service C: Token Service
app.post("/api/tokens", async (req, res) => {
  try {
    // No correlation with original request - RED FLAG #8
    console.log("Generating token for user:", req.body.userId);

    // Redis call with no tracing - RED FLAG #9
    const existingToken = await redis.get("token:" + req.body.userId);

    if (existingToken) {
      return res.json({ token: existingToken });
    }

    // JWT generation with no span context - RED FLAG #10
    const token = jwt.sign({ userId: req.body.userId }, process.env.JWT_SECRET);
    await redis.setex("token:" + req.body.userId, 3600, token);

    res.json({ token });
  } catch (error) {
    console.error("Token generation failed:", error.message);
    res.status(500).json({ error: "Token generation failed" });
  }
});

// Problems this creates:
// - No way to trace a user request across multiple services
// - Cannot correlate logs from different services for the same user action
// - No understanding of which service is causing slowdowns
// - Impossible to debug complex interaction failures
// - No visibility into service dependencies and bottlenecks

The Solution: Comprehensive Distributed Tracing with Context Propagation

// Production-ready distributed tracing across microservices
import {
  trace,
  context,
  propagation,
  SpanStatusCode,
  SpanKind,
} from "@opentelemetry/api";
import { NodeTracing } from "@opentelemetry/auto-instrumentation-node";
import { JaegerExporter } from "@opentelemetry/exporter-jaeger";
import { Resource } from "@opentelemetry/resources";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";
import { registerInstrumentations } from "@opentelemetry/instrumentation";
import { HttpInstrumentation } from "@opentelemetry/instrumentation-http";
import { ExpressInstrumentation } from "@opentelemetry/instrumentation-express";
import { RedisInstrumentation } from "@opentelemetry/instrumentation-redis";
import axios from "axios";

// Comprehensive tracing setup
export class DistributedTracingManager {
  private static instance: DistributedTracingManager;
  private tracer: any;

  static getInstance(): DistributedTracingManager {
    if (!DistributedTracingManager.instance) {
      DistributedTracingManager.instance = new DistributedTracingManager();
    }
    return DistributedTracingManager.instance;
  }

  constructor() {
    this.initializeTracing();
  }

  private initializeTracing(): void {
    // Configure tracing with proper service identification
    const tracing = new NodeTracing({
      resource: new Resource({
        [SemanticResourceAttributes.SERVICE_NAME]:
          process.env.SERVICE_NAME || "unknown-service",
        [SemanticResourceAttributes.SERVICE_VERSION]:
          process.env.SERVICE_VERSION || "1.0.0",
        [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]:
          process.env.NODE_ENV || "development",
      }),
      instrumentations: [
        new HttpInstrumentation({
          // Enhance HTTP spans with business context
          requestHook: (span, request) => {
            span.setAttributes({
              "http.request.body_size": request.headers["content-length"] || 0,
              "http.request.correlation_id":
                request.headers["x-correlation-id"] || "none",
            });
          },
          responseHook: (span, response) => {
            span.setAttributes({
              "http.response.body_size":
                response.headers["content-length"] || 0,
            });
          },
        }),
        new ExpressInstrumentation({
          // Add business context to Express spans
          requestHook: (span, info) => {
            span.setAttributes({
              "express.route": info.route,
              "user.id": info.req.user?.id || "anonymous",
              "user.tier": info.req.user?.tier || "free",
            });
          },
        }),
        new RedisInstrumentation({
          // Add Redis operation context
          responseHook: (span, cmdName, cmdArgs) => {
            span.setAttributes({
              "redis.key_pattern": this.extractKeyPattern(cmdArgs[0]),
            });
          },
        }),
      ],
    });

    // Configure Jaeger exporter for trace collection
    tracing.addSpanProcessor(
      new JaegerExporter({
        endpoint:
          process.env.JAEGER_ENDPOINT || "http://localhost:14268/api/traces",
        tags: [
          { key: "service.environment", value: process.env.NODE_ENV },
          {
            key: "service.datacenter",
            value: process.env.DATACENTER || "unknown",
          },
        ],
      })
    );

    tracing.start();
    this.tracer = trace.getTracer("business-operations");
  }

  // Create a traced HTTP client with automatic context propagation
  createTracedHttpClient(): TracedHttpClient {
    return new TracedHttpClient();
  }

  // Business operation tracing with automatic context propagation
  async traceBusinessOperation<T>(
    operationName: string,
    operationData: OperationContext,
    operation: () => Promise<T>
  ): Promise<T> {
    return this.tracer.startActiveSpan(
      operationName,
      {
        kind: SpanKind.INTERNAL,
        attributes: {
          "business.operation.type": operationData.type,
          "business.operation.id": operationData.id,
          "business.user.id": operationData.userId,
          "business.user.tier": operationData.userTier,
          "business.correlation.id": operationData.correlationId,
        },
      },
      async (span: any) => {
        try {
          const result = await operation();

          span.setAttributes({
            "business.operation.success": true,
            "business.operation.result_size": this.estimateObjectSize(result),
          });

          span.setStatus({ code: SpanStatusCode.OK });
          return result;
        } catch (error: any) {
          span.setAttributes({
            "business.operation.success": false,
            "business.operation.error_type": error.constructor.name,
            "business.operation.error_code": error.code || "unknown",
          });

          span.recordException(error);
          span.setStatus({
            code: SpanStatusCode.ERROR,
            message: error.message,
          });
          throw error;
        } finally {
          span.end();
        }
      }
    );
  }

  // Database operation tracing with performance metrics
  async traceDatabaseOperation<T>(
    operation: string,
    table: string,
    query: string,
    params: any[],
    executor: () => Promise<T>
  ): Promise<T> {
    return this.tracer.startActiveSpan(
      `db.${operation}`,
      {
        kind: SpanKind.CLIENT,
        attributes: {
          "db.system": "postgresql",
          "db.operation": operation,
          "db.table": table,
          "db.statement": this.sanitizeQuery(query),
          "db.parameters_count": params.length,
        },
      },
      async (span: any) => {
        const startTime = Date.now();

        try {
          const result = await executor();
          const duration = Date.now() - startTime;

          span.setAttributes({
            "db.duration_ms": duration,
            "db.rows_affected": Array.isArray(result) ? result.length : 1,
            "db.success": true,
          });

          span.setStatus({ code: SpanStatusCode.OK });
          return result;
        } catch (error: any) {
          const duration = Date.now() - startTime;

          span.setAttributes({
            "db.duration_ms": duration,
            "db.success": false,
            "db.error_code": error.code,
            "db.error_message": error.message,
          });

          span.recordException(error);
          span.setStatus({
            code: SpanStatusCode.ERROR,
            message: error.message,
          });
          throw error;
        } finally {
          span.end();
        }
      }
    );
  }

  // External service call tracing with retry and circuit breaker context
  async traceExternalCall<T>(
    serviceName: string,
    operation: string,
    url: string,
    method: string,
    call: () => Promise<T>,
    retryCount: number = 0
  ): Promise<T> {
    return this.tracer.startActiveSpan(
      `external.${serviceName}.${operation}`,
      {
        kind: SpanKind.CLIENT,
        attributes: {
          "http.method": method,
          "http.url": url,
          "service.name": serviceName,
          "service.operation": operation,
          "service.retry_count": retryCount,
        },
      },
      async (span: any) => {
        const startTime = Date.now();

        try {
          const result = await call();
          const duration = Date.now() - startTime;

          span.setAttributes({
            "http.response.duration_ms": duration,
            "http.response.success": true,
            "service.response_size": this.estimateObjectSize(result),
          });

          span.setStatus({ code: SpanStatusCode.OK });
          return result;
        } catch (error: any) {
          const duration = Date.now() - startTime;

          span.setAttributes({
            "http.response.duration_ms": duration,
            "http.response.success": false,
            "http.response.status_code": error.response?.status || 0,
            "service.error_type": error.constructor.name,
          });

          span.recordException(error);
          span.setStatus({
            code: SpanStatusCode.ERROR,
            message: error.message,
          });
          throw error;
        } finally {
          span.end();
        }
      }
    );
  }

  // Extract correlation context from incoming requests
  extractCorrelationContext(
    headers: Record<string, string>
  ): CorrelationContext {
    // Extract OpenTelemetry context from headers
    const parentContext = propagation.extract(context.active(), headers);
    const activeSpan = trace.getSpan(parentContext);

    return {
      traceId: activeSpan?.spanContext().traceId || "unknown",
      spanId: activeSpan?.spanContext().spanId || "unknown",
      correlationId:
        headers["x-correlation-id"] || this.generateCorrelationId(),
      userId: headers["x-user-id"],
      userTier: headers["x-user-tier"],
    };
  }

  // Inject correlation context into outgoing requests
  injectCorrelationContext(
    headers: Record<string, string> = {}
  ): Record<string, string> {
    const activeSpan = trace.getActiveSpan();

    // Inject OpenTelemetry context
    propagation.inject(context.active(), headers);

    // Add custom correlation headers
    if (activeSpan) {
      headers["x-correlation-id"] = activeSpan.spanContext().traceId;
    }

    return headers;
  }

  private extractKeyPattern(key: string): string {
    // Extract Redis key patterns for better observability
    return key.replace(/\d+/g, "*").replace(/[a-f0-9-]{8,}/g, "*");
  }

  private sanitizeQuery(query: string): string {
    // Remove sensitive data from query for logging
    return query.replace(/'\w+'/g, "'***'").substring(0, 200);
  }

  private estimateObjectSize(obj: any): number {
    try {
      return JSON.stringify(obj).length;
    } catch {
      return 0;
    }
  }

  private generateCorrelationId(): string {
    return `trace_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
  }
}

// Traced HTTP client with automatic context propagation
export class TracedHttpClient {
  private tracing: DistributedTracingManager;

  constructor() {
    this.tracing = DistributedTracingManager.getInstance();
  }

  async get<T>(url: string, options: RequestOptions = {}): Promise<T> {
    return this.tracing.traceExternalCall(
      this.extractServiceName(url),
      "GET",
      url,
      "GET",
      async () => {
        const headers = this.tracing.injectCorrelationContext(
          options.headers || {}
        );
        const response = await axios.get(url, { ...options, headers });
        return response.data;
      },
      options.retryCount
    );
  }

  async post<T>(
    url: string,
    data: any,
    options: RequestOptions = {}
  ): Promise<T> {
    return this.tracing.traceExternalCall(
      this.extractServiceName(url),
      "POST",
      url,
      "POST",
      async () => {
        const headers = this.tracing.injectCorrelationContext(
          options.headers || {}
        );
        const response = await axios.post(url, data, { ...options, headers });
        return response.data;
      },
      options.retryCount
    );
  }

  private extractServiceName(url: string): string {
    try {
      const urlObj = new URL(url);
      return urlObj.hostname.split(".")[0]; // Extract service name from hostname
    } catch {
      return "unknown-service";
    }
  }
}

// Complete microservice instrumentation example
export class MicroserviceInstrumentation {
  private tracing: DistributedTracingManager;
  private httpClient: TracedHttpClient;

  constructor() {
    this.tracing = DistributedTracingManager.getInstance();
    this.httpClient = this.tracing.createTracedHttpClient();
  }

  // Authentication service with complete tracing
  async authenticateUser(
    credentials: UserCredentials,
    headers: Record<string, string>
  ): Promise<AuthResult> {
    const correlationContext = this.tracing.extractCorrelationContext(headers);

    return this.tracing.traceBusinessOperation(
      "user.authenticate",
      {
        type: "authentication",
        id: correlationContext.correlationId,
        userId: credentials.email,
        userTier: "unknown",
        correlationId: correlationContext.correlationId,
      },
      async () => {
        // Step 1: Validate credentials with database tracing
        const user = await this.tracing.traceDatabaseOperation(
          "SELECT",
          "users",
          "SELECT id, email, password_hash, tier FROM users WHERE email = ?",
          [credentials.email],
          async () => {
            return await db.query(
              "SELECT id, email, password_hash, tier FROM users WHERE email = ?",
              [credentials.email]
            );
          }
        );

        if (
          !user ||
          !(await bcrypt.compare(credentials.password, user.password_hash))
        ) {
          throw new AuthenticationError("Invalid credentials");
        }

        // Step 2: Fetch permissions from external service
        const permissions = await this.httpClient.get<UserPermissions>(
          `${process.env.PERMISSION_SERVICE_URL}/api/permissions/${user.id}`,
          {
            headers: this.tracing.injectCorrelationContext({}),
            retryCount: 0,
          }
        );

        // Step 3: Generate token with external service call
        const tokenData = await this.httpClient.post<TokenResponse>(
          `${process.env.TOKEN_SERVICE_URL}/api/tokens`,
          { userId: user.id, permissions: permissions.roles },
          {
            headers: this.tracing.injectCorrelationContext({}),
            retryCount: 1,
          }
        );

        return {
          token: tokenData.token,
          user: {
            id: user.id,
            email: user.email,
            tier: user.tier,
          },
          permissions: permissions.roles,
        };
      }
    );
  }

  // Permission service with tracing
  async getUserPermissions(
    userId: string,
    headers: Record<string, string>
  ): Promise<UserPermissions> {
    const correlationContext = this.tracing.extractCorrelationContext(headers);

    return this.tracing.traceBusinessOperation(
      "permissions.fetch",
      {
        type: "authorization",
        id: correlationContext.correlationId,
        userId,
        userTier: correlationContext.userTier || "unknown",
        correlationId: correlationContext.correlationId,
      },
      async () => {
        // Fetch user permissions with database tracing
        const userRoles = await this.tracing.traceDatabaseOperation(
          "SELECT",
          "user_roles",
          "SELECT role_id FROM user_roles WHERE user_id = ?",
          [userId],
          async () => {
            return await db.query(
              "SELECT role_id FROM user_roles WHERE user_id = ?",
              [userId]
            );
          }
        );

        // Fetch role details from cache or database
        const roles = await Promise.all(
          userRoles.map((userRole: any) =>
            this.tracing.traceDatabaseOperation(
              "SELECT",
              "roles",
              "SELECT name, permissions FROM roles WHERE id = ?",
              [userRole.role_id],
              async () => {
                return await db.query(
                  "SELECT name, permissions FROM roles WHERE id = ?",
                  [userRole.role_id]
                );
              }
            )
          )
        );

        return {
          userId,
          roles: roles.map((role) => ({
            name: role.name,
            permissions: JSON.parse(role.permissions),
          })),
        };
      }
    );
  }

  // Token service with caching and tracing
  async generateToken(
    tokenRequest: TokenRequest,
    headers: Record<string, string>
  ): Promise<TokenResponse> {
    const correlationContext = this.tracing.extractCorrelationContext(headers);

    return this.tracing.traceBusinessOperation(
      "token.generate",
      {
        type: "token_generation",
        id: correlationContext.correlationId,
        userId: tokenRequest.userId,
        userTier: correlationContext.userTier || "unknown",
        correlationId: correlationContext.correlationId,
      },
      async () => {
        // Check for existing token in Redis with tracing
        const existingToken = await this.tracing.traceExternalCall(
          "redis",
          "GET",
          "redis://cache/tokens",
          "GET",
          async () => {
            return await redis.get(`token:${tokenRequest.userId}`);
          }
        );

        if (existingToken) {
          return { token: existingToken, expiresIn: 3600 };
        }

        // Generate new token
        const tokenData = {
          userId: tokenRequest.userId,
          permissions: tokenRequest.permissions,
          iat: Math.floor(Date.now() / 1000),
          exp: Math.floor(Date.now() / 1000) + 3600, // 1 hour
        };

        const token = jwt.sign(tokenData, process.env.JWT_SECRET!);

        // Cache token with tracing
        await this.tracing.traceExternalCall(
          "redis",
          "SETEX",
          "redis://cache/tokens",
          "SETEX",
          async () => {
            await redis.setex(`token:${tokenRequest.userId}`, 3600, token);
          }
        );

        return { token, expiresIn: 3600 };
      }
    );
  }
}

// Supporting interfaces
interface OperationContext {
  type: string;
  id: string;
  userId: string;
  userTier: string;
  correlationId: string;
}

interface CorrelationContext {
  traceId: string;
  spanId: string;
  correlationId: string;
  userId?: string;
  userTier?: string;
}

interface RequestOptions {
  headers?: Record<string, string>;
  timeout?: number;
  retryCount?: number;
}

interface UserCredentials {
  email: string;
  password: string;
}

interface AuthResult {
  token: string;
  user: {
    id: string;
    email: string;
    tier: string;
  };
  permissions: string[];
}

interface UserPermissions {
  userId: string;
  roles: Array<{
    name: string;
    permissions: string[];
  }>;
}

interface TokenRequest {
  userId: string;
  permissions: string[];
}

interface TokenResponse {
  token: string;
  expiresIn: number;
}

class AuthenticationError extends Error {
  constructor(message: string) {
    super(message);
    this.name = "AuthenticationError";
  }
}

Intelligent Alerting: From Noise to Actionable Insights

The Problem: Alert Fatigue and Meaningless Notifications

# The alerting nightmare that trains everyone to ignore critical issues
groups:
  - name: everything_is_an_emergency
    rules:
      # Alert on everything - RED FLAG #1
      - alert: CPUUsageHigh
        expr: cpu_usage_percent > 50 # Way too sensitive - RED FLAG #2
        for: 1m
        annotations:
          summary: "CPU usage is high" # Meaningless message - RED FLAG #3

      - alert: MemoryUsageHigh
        expr: memory_usage_percent > 60 # No context - RED FLAG #4
        for: 30s # Too quick to fire - RED FLAG #5
        annotations:
          summary: "Memory usage is high"

      - alert: DiskUsageHigh
        expr: disk_usage_percent > 70
        for: 1m
        annotations:
          summary: "Disk usage is high"

      - alert: HTTPRequests
        expr: rate(http_requests_total[5m]) > 100 # Arbitrary threshold - RED FLAG #6
        for: 1m
        annotations:
          summary: "HTTP requests are high"

      - alert: DatabaseConnections
        expr: database_connections > 50 # No scaling context - RED FLAG #7
        for: 1m
        annotations:
          summary: "Database connections are high"

      - alert: LogErrors
        expr: rate(log_errors_total[5m]) > 0 # Alert on any error - RED FLAG #8
        for: 30s
        annotations:
          summary: "Errors detected in logs"

      - alert: ResponseTime
        expr: http_response_time > 500ms # Same threshold for all endpoints - RED FLAG #9
        for: 1m
        annotations:
          summary: "Response time is slow"
# Problems this creates:
# - 15,000+ alerts per day that developers learn to ignore
# - No correlation between alerts and actual user impact
# - Same priority for minor issues and critical outages
# - No context about normal vs abnormal behavior
# - No actionable information for resolving issues
# - Alert fatigue leads to missing real emergencies

The Solution: Contextual, Intelligent Alerting with User Impact Focus

// Advanced alerting system with context, correlation, and user impact assessment
import { PrometheusRegistry, Counter, Histogram, Gauge } from "prom-client";
import { AlertManager, Alert, AlertSeverity } from "./alerting/types";
import { CorrelationEngine } from "./alerting/correlation";
import { UserImpactCalculator } from "./alerting/impact";

export class IntelligentAlertingSystem {
  private correlationEngine: CorrelationEngine;
  private impactCalculator: UserImpactCalculator;
  private alertSuppressionMap: Map<string, AlertSuppressionRule> = new Map();
  private alertHistory: Map<string, AlertHistory[]> = new Map();

  constructor() {
    this.correlationEngine = new CorrelationEngine();
    this.impactCalculator = new UserImpactCalculator();
    this.setupIntelligentAlerts();
  }

  private setupIntelligentAlerts(): void {
    this.registerBusinessTransactionAlerts();
    this.registerUserExperienceAlerts();
    this.registerInfrastructureAlerts();
    this.registerAnomalyDetectionAlerts();
  }

  // Business transaction focused alerts
  private registerBusinessTransactionAlerts(): void {
    // Payment processing failure rate
    this.createSmartAlert({
      name: "payment_processing_failure_rate",
      description:
        "Payment processing failure rate is above acceptable threshold",
      query:
        'rate(business_transactions_total{transaction_type="payment.process",outcome="failure"}[5m]) / rate(business_transactions_total{transaction_type="payment.process"}[5m]) > 0.01',
      severity: AlertSeverity.CRITICAL,
      evaluation: {
        for: "2m", // Wait 2 minutes to avoid false positives
        evaluateEvery: "30s",
      },
      context: {
        businessImpact: "Direct revenue loss from failed payments",
        userImpact: "Users cannot complete purchases",
        slaImpact: "99.9% payment success rate SLA breach",
        runbook: "https://runbooks.company.com/payment-failures",
      },
      dynamicThreshold: {
        baselineWindow: "1h",
        deviationMultiplier: 3.0,
        minimumSamples: 100,
      },
      suppressionRules: [
        {
          condition: "maintenance_window_active",
          duration: "4h",
        },
      ],
    });

    // User registration flow disruption
    this.createSmartAlert({
      name: "user_registration_flow_disruption",
      description: "User registration success rate has dropped significantly",
      query:
        'rate(business_transactions_total{transaction_type="user.register",outcome="success"}[10m]) / rate(business_transactions_total{transaction_type="user.register"}[10m]) < 0.95',
      severity: AlertSeverity.HIGH,
      evaluation: {
        for: "5m",
        evaluateEvery: "1m",
      },
      context: {
        businessImpact: "Reduced new user acquisition",
        userImpact: "New users cannot create accounts",
        slaImpact: "95% registration success rate SLA breach",
      },
      correlationRules: [
        "check_authentication_service_health",
        "check_email_service_health",
        "check_database_connectivity",
      ],
    });
  }

  // User experience focused alerts
  private registerUserExperienceAlerts(): void {
    // Page load time degradation
    this.createSmartAlert({
      name: "user_experience_degradation",
      description:
        "User experience is degraded based on response time percentiles",
      query:
        "histogram_quantile(0.95, rate(user_experience_response_time_bucket[5m])) > 2.0",
      severity: AlertSeverity.HIGH,
      evaluation: {
        for: "3m",
        evaluateEvery: "1m",
      },
      context: {
        businessImpact: "Increased bounce rate and user dissatisfaction",
        userImpact: "Slow page loads affecting user experience",
        slaImpact: "2 second 95th percentile response time SLA breach",
      },
      smartThreshold: {
        timeOfDay: {
          peak_hours: { threshold: 2.0, multiplier: 1.0 },
          off_hours: { threshold: 1.5, multiplier: 1.2 },
        },
        userTier: {
          premium: { threshold: 1.5, priority: "high" },
          free: { threshold: 3.0, priority: "medium" },
        },
      },
    });

    // Search functionality degradation
    this.createSmartAlert({
      name: "search_functionality_degradation",
      description: "Search functionality is experiencing performance issues",
      query:
        'rate(business_transactions_total{transaction_type="search.query",outcome="success"}[5m]) < 0.98',
      severity: AlertSeverity.MEDIUM,
      evaluation: {
        for: "3m",
        evaluateEvery: "1m",
      },
      context: {
        businessImpact: "Reduced product discovery and conversion",
        userImpact: "Users having difficulty finding products",
        slaImpact: "98% search success rate SLA breach",
      },
    });
  }

  // Infrastructure alerts with business context
  private registerInfrastructureAlerts(): void {
    // Database connection pool exhaustion
    this.createSmartAlert({
      name: "database_connection_pool_exhaustion",
      description: "Database connection pool is near exhaustion",
      query: "database_connections_active / database_connections_max > 0.85",
      severity: AlertSeverity.HIGH,
      evaluation: {
        for: "2m",
        evaluateEvery: "30s",
      },
      context: {
        businessImpact: "Potential for complete service outage",
        userImpact: "Users may experience timeout errors",
        technicalImpact: "Database query failures imminent",
        runbook: "https://runbooks.company.com/database-connections",
      },
      escalation: {
        levels: [
          { after: "5m", severity: AlertSeverity.CRITICAL },
          {
            after: "10m",
            severity: AlertSeverity.CRITICAL,
            notify: ["on-call-manager"],
          },
        ],
      },
    });

    // Memory leak detection
    this.createSmartAlert({
      name: "memory_leak_detection",
      description: "Potential memory leak detected based on usage trends",
      query: "increase(process_memory_usage_bytes[30m]) > 100000000", // 100MB increase in 30 minutes
      severity: AlertSeverity.MEDIUM,
      evaluation: {
        for: "5m",
        evaluateEvery: "1m",
      },
      context: {
        businessImpact: "Service instability and potential crashes",
        technicalImpact: "Memory exhaustion will lead to OOM kills",
        runbook: "https://runbooks.company.com/memory-leaks",
      },
      trendAnalysis: {
        window: "2h",
        projectedFailureTime: true,
      },
    });
  }

  // Anomaly detection based alerts
  private registerAnomalyDetectionAlerts(): void {
    // Traffic pattern anomaly
    this.createSmartAlert({
      name: "traffic_pattern_anomaly",
      description: "Unusual traffic pattern detected",
      query:
        "abs(rate(http_requests_total[5m]) - avg_over_time(rate(http_requests_total[5m])[1h:5m])) / avg_over_time(rate(http_requests_total[5m])[1h:5m]) > 0.5",
      severity: AlertSeverity.MEDIUM,
      evaluation: {
        for: "3m",
        evaluateEvery: "1m",
      },
      context: {
        businessImpact: "Potential DDoS attack or viral content",
        technicalImpact: "Infrastructure may be overwhelmed",
        investigationSteps: [
          "Check traffic sources and geographic distribution",
          "Verify CDN and load balancer performance",
          "Check for marketing campaigns or viral content",
        ],
      },
      anomalyDetection: {
        algorithm: "seasonal_decomposition",
        seasonality: "1d",
        sensitivity: "medium",
      },
    });
  }

  // Smart alert creation with context and intelligence
  private createSmartAlert(config: SmartAlertConfig): void {
    const alert: SmartAlert = {
      ...config,
      id: this.generateAlertId(config.name),
      createdAt: new Date(),
      evaluationHistory: [],
      suppressionState: "active",
    };

    // Set up dynamic threshold if configured
    if (config.dynamicThreshold) {
      alert.currentThreshold = this.calculateDynamicThreshold(
        config.query,
        config.dynamicThreshold
      );
    }

    // Register alert with monitoring system
    this.registerAlertRule(alert);
    console.log(`Registered smart alert: ${config.name}`);
  }

  // Alert evaluation with context and correlation
  async evaluateAlert(
    alert: SmartAlert,
    currentValue: number
  ): Promise<AlertEvaluationResult> {
    const evaluationContext: AlertEvaluationContext = {
      timestamp: new Date(),
      value: currentValue,
      businessContext: await this.getBusinessContext(),
      systemContext: await this.getSystemContext(),
      userImpactAssessment: await this.impactCalculator.calculateImpact(
        alert,
        currentValue
      ),
    };

    // Check suppression rules
    const suppressionCheck = this.checkSuppressionRules(
      alert,
      evaluationContext
    );
    if (suppressionCheck.shouldSuppress) {
      return {
        shouldFire: false,
        reason: suppressionCheck.reason,
        context: evaluationContext,
      };
    }

    // Check if threshold is breached
    const thresholdBreach = this.evaluateThreshold(
      alert,
      currentValue,
      evaluationContext
    );

    if (!thresholdBreach.isBreached) {
      return {
        shouldFire: false,
        reason: "Threshold not breached",
        context: evaluationContext,
      };
    }

    // Perform correlation analysis
    const correlatedEvents = await this.correlationEngine.findCorrelatedEvents(
      alert,
      evaluationContext
    );

    // Calculate alert priority based on context
    const priority = this.calculateAlertPriority(
      alert,
      evaluationContext,
      correlatedEvents
    );

    return {
      shouldFire: true,
      priority,
      correlatedEvents,
      context: evaluationContext,
      enrichedAlert: this.enrichAlertWithContext(
        alert,
        evaluationContext,
        correlatedEvents
      ),
    };
  }

  // Enrich alert with actionable context
  private enrichAlertWithContext(
    alert: SmartAlert,
    context: AlertEvaluationContext,
    correlatedEvents: CorrelatedEvent[]
  ): EnrichedAlert {
    return {
      id: alert.id,
      name: alert.name,
      severity: alert.severity,
      description: alert.description,
      currentValue: context.value,
      threshold: alert.currentThreshold,

      businessImpact: {
        description: alert.context.businessImpact,
        estimatedRevenueLoss: this.estimateRevenueLoss(alert, context),
        affectedUserCount: context.userImpactAssessment.affectedUsers,
        slaBreachRisk: alert.context.slaImpact,
      },

      technicalContext: {
        correlatedEvents,
        systemHealth: context.systemContext,
        recentChanges: await this.getRecentSystemChanges(),
        suggestedActions: this.generateSuggestedActions(
          alert,
          correlatedEvents
        ),
      },

      investigationContext: {
        runbookUrl: alert.context.runbook,
        relatedDashboards: this.getRelatedDashboards(alert),
        keyMetrics: await this.getKeyMetricsSnapshot(alert),
        similarIncidents: await this.findSimilarHistoricalIncidents(alert),
      },

      notificationContext: {
        urgency: this.calculateUrgency(alert, context),
        escalationPath: alert.escalation,
        suppressionRules: alert.suppressionRules,
        notificationChannels: this.selectNotificationChannels(alert, context),
      },
    };
  }

  // Dynamic threshold calculation based on historical data
  private async calculateDynamicThreshold(
    query: string,
    config: DynamicThresholdConfig
  ): Promise<number> {
    // Get historical data for baseline
    const historicalData = await this.queryHistoricalData(
      query,
      config.baselineWindow
    );

    if (historicalData.length < config.minimumSamples) {
      return config.fallbackThreshold || 0;
    }

    // Calculate statistical threshold
    const mean =
      historicalData.reduce((sum, val) => sum + val, 0) / historicalData.length;
    const stdDev = Math.sqrt(
      historicalData.reduce((sum, val) => sum + Math.pow(val - mean, 2), 0) /
        historicalData.length
    );

    return mean + stdDev * config.deviationMultiplier;
  }

  // Correlation engine for finding related events
  private async findCorrelatedEvents(
    alert: SmartAlert,
    context: AlertEvaluationContext
  ): Promise<CorrelatedEvent[]> {
    const timeWindow = 10; // 10 minutes
    const correlatedEvents: CorrelatedEvent[] = [];

    // Check for correlated alerts
    const recentAlerts = await this.getRecentAlerts(timeWindow);
    for (const recentAlert of recentAlerts) {
      const correlation = this.calculateCorrelation(alert, recentAlert);
      if (correlation.strength > 0.7) {
        correlatedEvents.push({
          type: "alert",
          correlation,
          event: recentAlert,
        });
      }
    }

    // Check for recent deployments
    const recentDeployments = await this.getRecentDeployments(timeWindow);
    for (const deployment of recentDeployments) {
      correlatedEvents.push({
        type: "deployment",
        correlation: { strength: 0.8, type: "temporal" },
        event: deployment,
      });
    }

    // Check for infrastructure changes
    const infraChanges = await this.getRecentInfrastructureChanges(timeWindow);
    for (const change of infraChanges) {
      correlatedEvents.push({
        type: "infrastructure",
        correlation: { strength: 0.6, type: "causal" },
        event: change,
      });
    }

    return correlatedEvents;
  }

  // Generate actionable suggestions based on alert type and context
  private generateSuggestedActions(
    alert: SmartAlert,
    correlatedEvents: CorrelatedEvent[]
  ): string[] {
    const actions: string[] = [];

    // Alert-specific actions
    const alertTypeActions = this.getAlertTypeSpecificActions(alert);
    actions.push(...alertTypeActions);

    // Context-based actions
    if (correlatedEvents.some((e) => e.type === "deployment")) {
      actions.push("Consider rolling back recent deployment");
      actions.push("Check deployment logs for errors");
    }

    if (correlatedEvents.some((e) => e.type === "infrastructure")) {
      actions.push("Verify infrastructure changes are properly applied");
      actions.push("Check for configuration drift");
    }

    // Historical incident actions
    const similarIncidents = await this.findSimilarHistoricalIncidents(alert);
    if (similarIncidents.length > 0) {
      const commonResolutions = this.extractCommonResolutions(similarIncidents);
      actions.push(...commonResolutions);
    }

    return [...new Set(actions)]; // Remove duplicates
  }

  // Supporting methods for business context
  private async getBusinessContext(): Promise<BusinessContext> {
    return {
      currentPromotions: await this.getCurrentPromotions(),
      peakTrafficPeriod: this.isPeakTrafficPeriod(),
      maintenanceWindows: await this.getActiveMaintenanceWindows(),
      criticalBusinessPeriods: this.isCriticalBusinessPeriod(),
    };
  }

  private async getSystemContext(): Promise<SystemContext> {
    return {
      overallSystemHealth: await this.getOverallSystemHealth(),
      recentDeployments: await this.getRecentDeployments(60), // 1 hour
      activeIncidents: await this.getActiveIncidents(),
      systemLoad: await this.getCurrentSystemLoad(),
    };
  }

  private calculateUrgency(
    alert: SmartAlert,
    context: AlertEvaluationContext
  ): AlertUrgency {
    let urgencyScore = 0;

    // Base urgency from severity
    switch (alert.severity) {
      case AlertSeverity.CRITICAL:
        urgencyScore += 40;
        break;
      case AlertSeverity.HIGH:
        urgencyScore += 30;
        break;
      case AlertSeverity.MEDIUM:
        urgencyScore += 20;
        break;
      case AlertSeverity.LOW:
        urgencyScore += 10;
        break;
    }

    // Business context multipliers
    if (context.businessContext.peakTrafficPeriod) urgencyScore *= 1.5;
    if (context.businessContext.criticalBusinessPeriods) urgencyScore *= 2.0;

    // User impact multipliers
    if (context.userImpactAssessment.affectedUsers > 10000) urgencyScore *= 1.8;
    if (context.userImpactAssessment.revenueImpact > 10000) urgencyScore *= 2.2;

    if (urgencyScore > 80) return AlertUrgency.IMMEDIATE;
    if (urgencyScore > 60) return AlertUrgency.HIGH;
    if (urgencyScore > 40) return AlertUrgency.MEDIUM;
    return AlertUrgency.LOW;
  }

  private generateAlertId(name: string): string {
    return `alert_${name}_${Date.now()}_${Math.random()
      .toString(36)
      .substr(2, 6)}`;
  }

  // Supporting placeholder methods (would be implemented with actual data sources)
  private async queryHistoricalData(
    query: string,
    window: string
  ): Promise<number[]> {
    return [];
  }
  private async getRecentAlerts(minutes: number): Promise<any[]> {
    return [];
  }
  private async getRecentDeployments(minutes: number): Promise<any[]> {
    return [];
  }
  private async getRecentInfrastructureChanges(
    minutes: number
  ): Promise<any[]> {
    return [];
  }
  private async getCurrentPromotions(): Promise<string[]> {
    return [];
  }
  private isPeakTrafficPeriod(): boolean {
    return false;
  }
  private async getActiveMaintenanceWindows(): Promise<any[]> {
    return [];
  }
  private isCriticalBusinessPeriod(): boolean {
    return false;
  }
  private async getOverallSystemHealth(): Promise<any> {
    return {};
  }
  private async getActiveIncidents(): Promise<any[]> {
    return [];
  }
  private async getCurrentSystemLoad(): Promise<any> {
    return {};
  }
  private async getRecentSystemChanges(): Promise<any[]> {
    return [];
  }
  private getRelatedDashboards(alert: SmartAlert): string[] {
    return [];
  }
  private async getKeyMetricsSnapshot(alert: SmartAlert): Promise<any> {
    return {};
  }
  private async findSimilarHistoricalIncidents(
    alert: SmartAlert
  ): Promise<any[]> {
    return [];
  }
  private selectNotificationChannels(
    alert: SmartAlert,
    context: AlertEvaluationContext
  ): string[] {
    return [];
  }
  private estimateRevenueLoss(
    alert: SmartAlert,
    context: AlertEvaluationContext
  ): number {
    return 0;
  }
  private calculateCorrelation(alert1: SmartAlert, alert2: any): any {
    return { strength: 0, type: "none" };
  }
  private getAlertTypeSpecificActions(alert: SmartAlert): string[] {
    return [];
  }
  private extractCommonResolutions(incidents: any[]): string[] {
    return [];
  }
  private calculateAlertPriority(
    alert: SmartAlert,
    context: AlertEvaluationContext,
    events: CorrelatedEvent[]
  ): number {
    return 1;
  }
  private evaluateThreshold(
    alert: SmartAlert,
    value: number,
    context: AlertEvaluationContext
  ): { isBreached: boolean } {
    return { isBreached: true };
  }
  private checkSuppressionRules(
    alert: SmartAlert,
    context: AlertEvaluationContext
  ): { shouldSuppress: boolean; reason?: string } {
    return { shouldSuppress: false };
  }
  private registerAlertRule(alert: SmartAlert): void {}
}

// Supporting interfaces and types
interface SmartAlertConfig {
  name: string;
  description: string;
  query: string;
  severity: AlertSeverity;
  evaluation: {
    for: string;
    evaluateEvery: string;
  };
  context: {
    businessImpact: string;
    userImpact?: string;
    slaImpact?: string;
    technicalImpact?: string;
    runbook?: string;
    investigationSteps?: string[];
  };
  dynamicThreshold?: DynamicThresholdConfig;
  smartThreshold?: SmartThresholdConfig;
  suppressionRules?: AlertSuppressionRule[];
  correlationRules?: string[];
  escalation?: AlertEscalation;
  trendAnalysis?: TrendAnalysisConfig;
  anomalyDetection?: AnomalyDetectionConfig;
}

interface SmartAlert extends SmartAlertConfig {
  id: string;
  createdAt: Date;
  evaluationHistory: any[];
  suppressionState: string;
  currentThreshold?: number;
}

enum AlertSeverity {
  LOW = "low",
  MEDIUM = "medium",
  HIGH = "high",
  CRITICAL = "critical",
}

enum AlertUrgency {
  LOW = "low",
  MEDIUM = "medium",
  HIGH = "high",
  IMMEDIATE = "immediate",
}

interface DynamicThresholdConfig {
  baselineWindow: string;
  deviationMultiplier: number;
  minimumSamples: number;
  fallbackThreshold?: number;
}

interface SmartThresholdConfig {
  timeOfDay?: Record<string, { threshold: number; multiplier?: number }>;
  userTier?: Record<string, { threshold: number; priority?: string }>;
}

interface AlertSuppressionRule {
  condition: string;
  duration: string;
}

interface AlertEscalation {
  levels: Array<{
    after: string;
    severity: AlertSeverity;
    notify?: string[];
  }>;
}

interface TrendAnalysisConfig {
  window: string;
  projectedFailureTime: boolean;
}

interface AnomalyDetectionConfig {
  algorithm: string;
  seasonality: string;
  sensitivity: string;
}

interface AlertEvaluationContext {
  timestamp: Date;
  value: number;
  businessContext: BusinessContext;
  systemContext: SystemContext;
  userImpactAssessment: UserImpactAssessment;
}

interface BusinessContext {
  currentPromotions: string[];
  peakTrafficPeriod: boolean;
  maintenanceWindows: any[];
  criticalBusinessPeriods: boolean;
}

interface SystemContext {
  overallSystemHealth: any;
  recentDeployments: any[];
  activeIncidents: any[];
  systemLoad: any;
}

interface UserImpactAssessment {
  affectedUsers: number;
  revenueImpact: number;
  serviceDegradation: string;
}

interface CorrelatedEvent {
  type: "alert" | "deployment" | "infrastructure" | "business";
  correlation: {
    strength: number;
    type: string;
  };
  event: any;
}

interface EnrichedAlert {
  id: string;
  name: string;
  severity: AlertSeverity;
  description: string;
  currentValue: number;
  threshold?: number;
  businessImpact: {
    description: string;
    estimatedRevenueLoss: number;
    affectedUserCount: number;
    slaBreachRisk?: string;
  };
  technicalContext: {
    correlatedEvents: CorrelatedEvent[];
    systemHealth: any;
    recentChanges: any[];
    suggestedActions: string[];
  };
  investigationContext: {
    runbookUrl?: string;
    relatedDashboards: string[];
    keyMetrics: any;
    similarIncidents: any[];
  };
  notificationContext: {
    urgency: AlertUrgency;
    escalationPath?: AlertEscalation;
    suppressionRules?: AlertSuppressionRule[];
    notificationChannels: string[];
  };
}

interface AlertEvaluationResult {
  shouldFire: boolean;
  reason?: string;
  priority?: number;
  correlatedEvents?: CorrelatedEvent[];
  context: AlertEvaluationContext;
  enrichedAlert?: EnrichedAlert;
}

interface AlertHistory {
  timestamp: Date;
  value: number;
  fired: boolean;
  suppressed: boolean;
}

This comprehensive backend monitoring and observability guide gives you:

Strategic observability implementation focused on user experience and business outcomes rather than vanity metrics
Advanced distributed tracing that provides end-to-end visibility across complex microservices architectures
Intelligent alerting systems that reduce noise while providing actionable insights with business context
Production-ready monitoring patterns that scale with system complexity and team growth
Contextual incident response that helps teams resolve issues faster with correlated data and suggested actions

The difference between monitoring systems that prevent disasters and those that hide them isn’t just collecting more data—it’s understanding user impact, business context, and system behavior patterns to create actionable insights that help teams deliver reliable software experiences.