Backend Monitoring & Observability
The $12.5 Million “Everything is Fine” Monitoring Disaster That Broke the Internet
Picture this observability nightmare: A major streaming platform with 300 million subscribers launches their “next-generation” monitoring system just before the season finale of their most popular show. Their DevOps team, fresh from a Kubernetes conference, implements every trendy observability tool they could find: Prometheus, Grafana, Jaeger, ELK stack, and custom dashboards showing green checkmarks everywhere.
Sunday night, 100 million users try to watch the finale simultaneously. The monitoring dashboard shows everything is “operational” while users experience complete service outages.
The symptoms were catastrophically expensive:
- Revenue loss hit $12.5 million in 4 hours: Premium subscribers cancelled en masse after missing the season finale
- Customer service received 2.3 million complaints: Users couldn’t stream anything while metrics showed “99.9% uptime”
- Social media exploded with 847,000 angry posts: #StreamingDown trended worldwide while their status page showed “All Systems Operational”
- Database queries averaged 47 seconds: Critical user authentication was failing while CPU metrics looked normal
- CDN served 404s to 67% of requests: Video content wasn’t loading while bandwidth graphs showed healthy traffic
- Memory leaks consumed 94% of available RAM: Application servers were thrashing while memory alerts never fired
Here’s what their expensive monitoring post-mortem revealed:
- Vanity metrics instead of user impact: They monitored CPU and memory but ignored actual user experience and business transactions
- Alert fatigue from irrelevant notifications: 15,000+ alerts per day about non-critical issues, causing engineers to ignore all notifications
- No distributed tracing across services: They couldn’t track user requests through their 247 microservices architecture
- Monitoring the monitoring tools: They spent more time fixing Grafana dashboards than actual application issues
- Missing business logic observability: They tracked infrastructure but had zero visibility into authentication failures, payment processing, or content delivery
- Reactive instead of predictive monitoring: Every alert fired after users were already impacted, not before problems occurred
The final damage:
- $12.5 million in lost revenue from cancellations and refunds during a single evening
- 67% drop in user trust scores as subscribers lost faith in service reliability
- 8 months of engineering time rebuilding their entire observability infrastructure from scratch
- Complete executive turnover in the engineering organization after the board intervention
- Regulatory investigation as the outage affected emergency broadcast capabilities
The brutal truth? They had comprehensive monitoring that monitored everything except what actually mattered to users and the business.
The Uncomfortable Truth About Monitoring and Observability
Here’s what separates monitoring systems that prevent disasters from those that hide them: True observability isn’t about collecting more data—it’s about understanding user impact, business outcomes, and system behavior patterns. The more metrics you collect without context, the less visibility you actually have into what matters.
Most developers approach monitoring like this:
- Install popular monitoring tools and collect every metric available
- Create beautiful dashboards that show technical metrics instead of user experience
- Set up alerts based on arbitrary thresholds without understanding user impact
- React to problems after users are already affected
- Focus on infrastructure metrics while ignoring business logic and user journeys
But developers who build truly observable systems work differently:
- Monitor user experience first, infrastructure second by tracking real user interactions and business transactions
- Implement distributed tracing to understand how requests flow through complex microservices architectures
- Use contextual alerting that correlates multiple signals to reduce noise and focus on actual problems
- Build predictive monitoring that identifies problems before users are impacted
- Instrument business logic to understand not just if systems are running, but if they’re delivering value
The difference isn’t just knowing when problems occur—it’s understanding why they happen and preventing them from affecting users in the first place.
Ready to build monitoring systems that actually help you deliver reliable software instead of creating a false sense of security? Let’s dive into observability patterns that work in production.
Foundation of Effective Observability: The Three Pillars Done Right
The Problem: Monitoring Everything While Seeing Nothing
// The observability nightmare that creates more problems than it solves
const express = require("express");
const prometheus = require("prom-client");
// Collecting every metric imaginable - RED FLAG #1
const httpRequestsTotal = new prometheus.Counter({
name: "http_requests_total",
help: "Total number of HTTP requests",
labelNames: [
"method",
"route",
"status_code",
"user_agent",
"ip",
"referrer",
],
});
const httpRequestDuration = new prometheus.Histogram({
name: "http_request_duration_seconds",
help: "HTTP request duration in seconds",
buckets: [0.1, 0.2, 0.3, 0.4, 0.5, 1, 2, 5, 10, 30], // Too many buckets - RED FLAG #2
labelNames: ["method", "route", "status_code"],
});
const memoryUsage = new prometheus.Gauge({
name: "process_memory_usage_bytes",
help: "Process memory usage",
collect() {
const usage = process.memoryUsage();
this.set(usage.heapUsed);
this.set({ type: "rss" }, usage.rss);
this.set({ type: "heapTotal" }, usage.heapTotal);
this.set({ type: "external" }, usage.external);
},
});
// Logging everything without structure - RED FLAG #3
app.use((req, res, next) => {
console.log(`${new Date().toISOString()} ${req.method} ${req.url}`);
console.log("Headers:", req.headers);
console.log("Body:", req.body);
console.log("Query:", req.query);
console.log("User Agent:", req.get("User-Agent"));
console.log("IP:", req.ip);
// Increment metrics for every request - RED FLAG #4
httpRequestsTotal.inc({
method: req.method,
route: req.route?.path || "unknown",
status_code: res.statusCode,
user_agent: req.get("User-Agent"),
ip: req.ip,
referrer: req.get("Referrer") || "none",
});
next();
});
// Business logic with no observability - RED FLAG #5
app.post("/api/orders", async (req, res) => {
try {
// No tracing of business operations
const user = await User.findById(req.body.userId);
const product = await Product.findById(req.body.productId);
const inventory = await checkInventory(product.id);
const payment = await processPayment(req.body.paymentInfo);
const order = await createOrder(user, product, payment);
res.json({ success: true, orderId: order.id });
} catch (error) {
// Generic error logging - RED FLAG #6
console.error("Order creation failed:", error.message);
res.status(500).json({ error: "Internal server error" });
}
});
// Alert on everything - RED FLAG #7
const alertRules = [
"cpu_usage > 50%", // Too sensitive
"memory_usage > 60%", // No context
"disk_usage > 70%", // Arbitrary threshold
"http_requests_total > 100", // No time window context
"error_count > 0", // Alert fatigue guaranteed
];
// Problems this creates:
// - High-cardinality metrics explode storage costs and query times
// - Logs are unstructured and impossible to search effectively
// - No connection between metrics, traces, and logs
// - Alerts fire constantly for non-critical issues
// - No understanding of user impact or business outcomes
// - Missing context about why problems occur
The Solution: Strategic Observability with User-Centric Monitoring
// Production-ready observability with proper instrumentation
import express from "express";
import { Counter, Histogram, Gauge, register } from "prom-client";
import { trace, context, SpanStatusCode } from "@opentelemetry/api";
import { NodeTracing } from "@opentelemetry/auto-instrumentation-node";
import { createLogger, format, transports } from "winston";
import { correlationId } from "./middleware/correlation";
import { JaegerExporter } from "@opentelemetry/exporter-jaeger";
// Strategic metric collection focused on user experience
export class ObservabilityEngine {
private static instance: ObservabilityEngine;
// Business-focused metrics (RED method: Rate, Errors, Duration)
private readonly businessTransactionRate = new Counter({
name: "business_transactions_total",
help: "Total business transactions by type and outcome",
labelNames: ["transaction_type", "outcome", "customer_segment"],
});
private readonly businessTransactionDuration = new Histogram({
name: "business_transaction_duration_seconds",
help: "Business transaction duration by type",
buckets: [0.1, 0.3, 0.5, 1, 2, 5, 10], // Focused buckets based on SLA
labelNames: ["transaction_type", "customer_segment"],
});
private readonly userExperienceMetrics = new Histogram({
name: "user_experience_response_time",
help: "User-perceived response time",
buckets: [0.1, 0.2, 0.5, 1, 2, 3], // Based on user experience research
labelNames: ["endpoint", "user_tier"],
});
// Infrastructure metrics (USE method: Utilization, Saturation, Errors)
private readonly systemUtilization = new Gauge({
name: "system_utilization_ratio",
help: "System resource utilization",
labelNames: ["resource_type", "instance"],
});
private readonly systemSaturation = new Gauge({
name: "system_saturation_ratio",
help: "System resource saturation",
labelNames: ["resource_type", "instance"],
});
// Structured logging with correlation
private readonly logger = createLogger({
format: format.combine(
format.timestamp(),
format.errors({ stack: true }),
format.json(),
format((info) => {
// Add correlation ID to all logs
info.correlationId = correlationId.getStore()?.correlationId;
info.traceId = trace.getActiveSpan()?.spanContext().traceId;
info.spanId = trace.getActiveSpan()?.spanContext().spanId;
return info;
})()
),
transports: [
new transports.Console(),
new transports.File({
filename: "logs/error.log",
level: "error",
maxsize: 50 * 1024 * 1024, // 50MB
maxFiles: 10,
}),
new transports.File({
filename: "logs/business.log",
level: "info",
maxsize: 100 * 1024 * 1024, // 100MB
maxFiles: 20,
}),
],
});
static getInstance(): ObservabilityEngine {
if (!ObservabilityEngine.instance) {
ObservabilityEngine.instance = new ObservabilityEngine();
}
return ObservabilityEngine.instance;
}
// Business transaction instrumentation
instrumentBusinessTransaction<T>(
transactionType: string,
customerSegment: string,
operation: () => Promise<T>
): Promise<T> {
return this.withBusinessSpan(transactionType, async (span) => {
const timer = this.businessTransactionDuration.startTimer({
transaction_type: transactionType,
customer_segment: customerSegment,
});
try {
span.setAttributes({
"business.transaction.type": transactionType,
"business.customer.segment": customerSegment,
"business.transaction.id":
correlationId.getStore()?.correlationId || "unknown",
});
const result = await operation();
// Record successful business transaction
this.businessTransactionRate.inc({
transaction_type: transactionType,
outcome: "success",
customer_segment: customerSegment,
});
span.setStatus({ code: SpanStatusCode.OK });
this.logger.info("Business transaction completed", {
transactionType,
customerSegment,
outcome: "success",
duration: timer(),
});
return result;
} catch (error: any) {
// Record failed business transaction with context
this.businessTransactionRate.inc({
transaction_type: transactionType,
outcome: "failure",
customer_segment: customerSegment,
});
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
this.logger.error("Business transaction failed", {
transactionType,
customerSegment,
error: error.message,
errorCode: error.code,
stack: error.stack,
duration: timer(),
});
throw error;
}
});
}
// User experience monitoring
trackUserExperience(
endpoint: string,
userTier: string,
responseTime: number,
success: boolean
): void {
this.userExperienceMetrics.observe(
{ endpoint, user_tier: userTier },
responseTime / 1000 // Convert to seconds
);
this.logger.info("User experience tracked", {
endpoint,
userTier,
responseTime,
success,
experienceGrade: this.calculateExperienceGrade(responseTime),
});
}
// System health monitoring
updateSystemMetrics(): void {
const usage = process.memoryUsage();
const cpuUsage = process.cpuUsage();
// Memory utilization
this.systemUtilization.set(
{ resource_type: "memory", instance: process.env.HOSTNAME || "unknown" },
usage.heapUsed / usage.heapTotal
);
// Memory saturation (approaching limits)
this.systemSaturation.set(
{ resource_type: "memory", instance: process.env.HOSTNAME || "unknown" },
usage.heapUsed / (usage.heapTotal * 0.8) // Alert at 80% of heap
);
this.logger.debug("System metrics updated", {
memory: {
heapUsed: usage.heapUsed,
heapTotal: usage.heapTotal,
utilization: usage.heapUsed / usage.heapTotal,
},
});
}
// Distributed tracing helpers
private async withBusinessSpan<T>(
name: string,
operation: (span: any) => Promise<T>
): Promise<T> {
const tracer = trace.getTracer("business-operations");
return tracer.startActiveSpan(name, async (span) => {
try {
return await operation(span);
} finally {
span.end();
}
});
}
// Alert context enrichment
enrichAlertContext(alertData: AlertData): EnrichedAlert {
const correlationId = correlationId.getStore()?.correlationId;
const activeSpan = trace.getActiveSpan();
return {
...alertData,
context: {
correlationId,
traceId: activeSpan?.spanContext().traceId,
spanId: activeSpan?.spanContext().spanId,
businessContext: this.getBusinessContext(),
userImpact: this.calculateUserImpact(alertData),
suggestedActions: this.getSuggestedActions(alertData),
},
};
}
private calculateExperienceGrade(responseTime: number): string {
if (responseTime < 200) return "excellent";
if (responseTime < 500) return "good";
if (responseTime < 1000) return "fair";
if (responseTime < 2000) return "poor";
return "unacceptable";
}
private getBusinessContext(): any {
// Would integrate with business context provider
return {
currentPromotions: ["black-friday-sale"],
peakHours: this.isDuringPeakHours(),
maintenanceWindows: [],
};
}
private calculateUserImpact(alertData: AlertData): UserImpactAssessment {
// Sophisticated user impact calculation based on alert type and context
const impactFactors = {
"payment-processing-error": {
severity: "critical",
affectedUsers: "all-paying",
},
"authentication-failure": {
severity: "high",
affectedUsers: "all-users",
},
"search-slowdown": { severity: "medium", affectedUsers: "search-users" },
"recommendation-error": {
severity: "low",
affectedUsers: "browsing-users",
},
};
return (
impactFactors[alertData.type] || {
severity: "unknown",
affectedUsers: "unknown",
}
);
}
private getSuggestedActions(alertData: AlertData): string[] {
// Runbook automation suggestions
const actionMap: Record<string, string[]> = {
"high-cpu": [
"Check for memory leaks in recent deployments",
"Scale horizontally if sustained high load",
"Review slow queries in the last hour",
],
"payment-errors": [
"Check payment provider status",
"Verify payment service database connections",
"Review recent payment service deployments",
"Activate backup payment processor if needed",
],
"database-slow": [
"Check for long-running queries",
"Verify index utilization",
"Check database connection pool status",
"Consider read replica failover",
],
};
return (
actionMap[alertData.type] || ["Check system logs and recent changes"]
);
}
private isDuringPeakHours(): boolean {
const hour = new Date().getHours();
return hour >= 18 && hour <= 23; // 6 PM to 11 PM
}
}
// Comprehensive application instrumentation
export class ApplicationInstrumentation {
private observability: ObservabilityEngine;
constructor() {
this.observability = ObservabilityEngine.getInstance();
this.setupSystemMetricsCollection();
}
// Express middleware for request instrumentation
requestInstrumentation() {
return async (
req: express.Request,
res: express.Response,
next: express.NextFunction
) => {
const startTime = Date.now();
const correlationId =
req.headers["x-correlation-id"] || this.generateCorrelationId();
// Set correlation context for the entire request
correlationId.run({ correlationId }, async () => {
const tracer = trace.getTracer("http-requests");
await tracer.startActiveSpan(
`${req.method} ${req.path}`,
async (span) => {
span.setAttributes({
"http.method": req.method,
"http.url": req.url,
"http.user_agent": req.get("User-Agent") || "",
"user.tier": req.user?.tier || "anonymous",
"request.correlation_id": correlationId,
});
try {
res.on("finish", () => {
const duration = Date.now() - startTime;
const userTier = req.user?.tier || "anonymous";
span.setAttributes({
"http.status_code": res.statusCode,
"http.response_size": res.get("content-length") || 0,
});
// Track user experience
this.observability.trackUserExperience(
req.path,
userTier,
duration,
res.statusCode < 400
);
if (res.statusCode >= 400) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: `HTTP ${res.statusCode}`,
});
}
span.end();
});
next();
} catch (error: any) {
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
next(error);
}
}
);
});
};
}
// Database operation instrumentation
instrumentDatabaseOperation<T>(
operation: string,
table: string,
query: () => Promise<T>
): Promise<T> {
return this.observability.instrumentBusinessTransaction(
`database.${operation}`,
"system",
async () => {
const tracer = trace.getTracer("database");
return tracer.startActiveSpan(
`db.${operation}.${table}`,
async (span) => {
span.setAttributes({
"db.operation": operation,
"db.table": table,
"db.type": "postgresql",
});
try {
const result = await query();
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error: any) {
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}
);
}
);
}
// External service call instrumentation
instrumentExternalCall<T>(
serviceName: string,
operation: string,
call: () => Promise<T>
): Promise<T> {
return this.observability.instrumentBusinessTransaction(
`external.${serviceName}.${operation}`,
"system",
async () => {
const tracer = trace.getTracer("external-services");
return tracer.startActiveSpan(
`external.${serviceName}.${operation}`,
async (span) => {
span.setAttributes({
"service.name": serviceName,
"service.operation": operation,
"service.type": "external",
});
const startTime = Date.now();
try {
const result = await call();
const duration = Date.now() - startTime;
span.setAttributes({
"service.duration_ms": duration,
"service.success": true,
});
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error: any) {
const duration = Date.now() - startTime;
span.setAttributes({
"service.duration_ms": duration,
"service.success": false,
"service.error": error.message,
});
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}
);
}
);
}
private setupSystemMetricsCollection(): void {
// Collect system metrics every 15 seconds
setInterval(() => {
this.observability.updateSystemMetrics();
}, 15000);
}
private generateCorrelationId(): string {
return `req_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
}
}
// Business-specific monitoring
export class BusinessMonitoring {
private observability: ObservabilityEngine;
constructor() {
this.observability = ObservabilityEngine.getInstance();
}
// E-commerce specific monitoring
async monitorOrderProcessing(orderData: OrderData): Promise<void> {
await this.observability.instrumentBusinessTransaction(
"order.create",
orderData.customerSegment,
async () => {
const tracer = trace.getTracer("business.orders");
await tracer.startActiveSpan("process.order", async (span) => {
span.setAttributes({
"order.id": orderData.id,
"order.value": orderData.totalValue,
"order.items_count": orderData.items.length,
"customer.segment": orderData.customerSegment,
"customer.tier": orderData.customerTier,
});
// Track business-specific metrics
this.trackBusinessMetrics("order_created", {
value: orderData.totalValue,
customerSegment: orderData.customerSegment,
itemsCount: orderData.items.length,
});
span.end();
});
}
);
}
// Payment processing monitoring
async monitorPaymentProcessing(paymentData: PaymentData): Promise<void> {
await this.observability.instrumentBusinessTransaction(
"payment.process",
paymentData.customerSegment,
async () => {
const tracer = trace.getTracer("business.payments");
await tracer.startActiveSpan("process.payment", async (span) => {
span.setAttributes({
"payment.method": paymentData.method,
"payment.amount": paymentData.amount,
"payment.currency": paymentData.currency,
"payment.processor": paymentData.processor,
});
this.trackBusinessMetrics("payment_processed", {
amount: paymentData.amount,
method: paymentData.method,
processor: paymentData.processor,
});
span.end();
});
}
);
}
private trackBusinessMetrics(eventType: string, data: any): void {
const logger = createLogger({ transports: [new transports.Console()] });
logger.info("Business event tracked", {
eventType,
...data,
timestamp: new Date().toISOString(),
});
}
}
// Supporting interfaces
interface AlertData {
type: string;
severity: "low" | "medium" | "high" | "critical";
message: string;
timestamp: Date;
source: string;
}
interface EnrichedAlert extends AlertData {
context: {
correlationId?: string;
traceId?: string;
spanId?: string;
businessContext: any;
userImpact: UserImpactAssessment;
suggestedActions: string[];
};
}
interface UserImpactAssessment {
severity: "low" | "medium" | "high" | "critical" | "unknown";
affectedUsers: string;
}
interface OrderData {
id: string;
totalValue: number;
items: any[];
customerSegment: string;
customerTier: string;
}
interface PaymentData {
method: string;
amount: number;
currency: string;
processor: string;
customerSegment: string;
}
Distributed Tracing: Understanding Complex System Interactions
The Problem: Black Box Microservices with No Visibility
// The distributed tracing nightmare - no correlation between services
// Service A: User Authentication Service
app.post("/auth/login", async (req, res) => {
try {
// No trace correlation - RED FLAG #1
console.log("Login attempt started");
const user = await userService.validateCredentials(
req.body.email,
req.body.password
);
// Calling external service with no tracing - RED FLAG #2
const permissions = await fetch(
"http://permission-service/api/permissions/" + user.id
);
// No context propagation - RED FLAG #3
const token = await tokenService.generateToken(user.id);
console.log("Login successful");
res.json({ token, user: user.id });
} catch (error) {
// No trace correlation in error - RED FLAG #4
console.error("Login failed:", error.message);
res.status(401).json({ error: "Authentication failed" });
}
});
// Service B: Permission Service
app.get("/api/permissions/:userId", async (req, res) => {
try {
// No incoming trace context - RED FLAG #5
console.log("Fetching permissions for user:", req.params.userId);
// Database call with no tracing - RED FLAG #6
const permissions = await db.query(
"SELECT * FROM permissions WHERE user_id = ?",
[req.params.userId]
);
// Calling another service - RED FLAG #7
const roleData = await fetch(
"http://role-service/api/roles/" + permissions.roleId
);
res.json(permissions);
} catch (error) {
console.error("Permission fetch failed:", error.message);
res.status(500).json({ error: "Internal server error" });
}
});
// Service C: Token Service
app.post("/api/tokens", async (req, res) => {
try {
// No correlation with original request - RED FLAG #8
console.log("Generating token for user:", req.body.userId);
// Redis call with no tracing - RED FLAG #9
const existingToken = await redis.get("token:" + req.body.userId);
if (existingToken) {
return res.json({ token: existingToken });
}
// JWT generation with no span context - RED FLAG #10
const token = jwt.sign({ userId: req.body.userId }, process.env.JWT_SECRET);
await redis.setex("token:" + req.body.userId, 3600, token);
res.json({ token });
} catch (error) {
console.error("Token generation failed:", error.message);
res.status(500).json({ error: "Token generation failed" });
}
});
// Problems this creates:
// - No way to trace a user request across multiple services
// - Cannot correlate logs from different services for the same user action
// - No understanding of which service is causing slowdowns
// - Impossible to debug complex interaction failures
// - No visibility into service dependencies and bottlenecks
The Solution: Comprehensive Distributed Tracing with Context Propagation
// Production-ready distributed tracing across microservices
import {
trace,
context,
propagation,
SpanStatusCode,
SpanKind,
} from "@opentelemetry/api";
import { NodeTracing } from "@opentelemetry/auto-instrumentation-node";
import { JaegerExporter } from "@opentelemetry/exporter-jaeger";
import { Resource } from "@opentelemetry/resources";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";
import { registerInstrumentations } from "@opentelemetry/instrumentation";
import { HttpInstrumentation } from "@opentelemetry/instrumentation-http";
import { ExpressInstrumentation } from "@opentelemetry/instrumentation-express";
import { RedisInstrumentation } from "@opentelemetry/instrumentation-redis";
import axios from "axios";
// Comprehensive tracing setup
export class DistributedTracingManager {
private static instance: DistributedTracingManager;
private tracer: any;
static getInstance(): DistributedTracingManager {
if (!DistributedTracingManager.instance) {
DistributedTracingManager.instance = new DistributedTracingManager();
}
return DistributedTracingManager.instance;
}
constructor() {
this.initializeTracing();
}
private initializeTracing(): void {
// Configure tracing with proper service identification
const tracing = new NodeTracing({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]:
process.env.SERVICE_NAME || "unknown-service",
[SemanticResourceAttributes.SERVICE_VERSION]:
process.env.SERVICE_VERSION || "1.0.0",
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]:
process.env.NODE_ENV || "development",
}),
instrumentations: [
new HttpInstrumentation({
// Enhance HTTP spans with business context
requestHook: (span, request) => {
span.setAttributes({
"http.request.body_size": request.headers["content-length"] || 0,
"http.request.correlation_id":
request.headers["x-correlation-id"] || "none",
});
},
responseHook: (span, response) => {
span.setAttributes({
"http.response.body_size":
response.headers["content-length"] || 0,
});
},
}),
new ExpressInstrumentation({
// Add business context to Express spans
requestHook: (span, info) => {
span.setAttributes({
"express.route": info.route,
"user.id": info.req.user?.id || "anonymous",
"user.tier": info.req.user?.tier || "free",
});
},
}),
new RedisInstrumentation({
// Add Redis operation context
responseHook: (span, cmdName, cmdArgs) => {
span.setAttributes({
"redis.key_pattern": this.extractKeyPattern(cmdArgs[0]),
});
},
}),
],
});
// Configure Jaeger exporter for trace collection
tracing.addSpanProcessor(
new JaegerExporter({
endpoint:
process.env.JAEGER_ENDPOINT || "http://localhost:14268/api/traces",
tags: [
{ key: "service.environment", value: process.env.NODE_ENV },
{
key: "service.datacenter",
value: process.env.DATACENTER || "unknown",
},
],
})
);
tracing.start();
this.tracer = trace.getTracer("business-operations");
}
// Create a traced HTTP client with automatic context propagation
createTracedHttpClient(): TracedHttpClient {
return new TracedHttpClient();
}
// Business operation tracing with automatic context propagation
async traceBusinessOperation<T>(
operationName: string,
operationData: OperationContext,
operation: () => Promise<T>
): Promise<T> {
return this.tracer.startActiveSpan(
operationName,
{
kind: SpanKind.INTERNAL,
attributes: {
"business.operation.type": operationData.type,
"business.operation.id": operationData.id,
"business.user.id": operationData.userId,
"business.user.tier": operationData.userTier,
"business.correlation.id": operationData.correlationId,
},
},
async (span: any) => {
try {
const result = await operation();
span.setAttributes({
"business.operation.success": true,
"business.operation.result_size": this.estimateObjectSize(result),
});
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error: any) {
span.setAttributes({
"business.operation.success": false,
"business.operation.error_type": error.constructor.name,
"business.operation.error_code": error.code || "unknown",
});
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}
);
}
// Database operation tracing with performance metrics
async traceDatabaseOperation<T>(
operation: string,
table: string,
query: string,
params: any[],
executor: () => Promise<T>
): Promise<T> {
return this.tracer.startActiveSpan(
`db.${operation}`,
{
kind: SpanKind.CLIENT,
attributes: {
"db.system": "postgresql",
"db.operation": operation,
"db.table": table,
"db.statement": this.sanitizeQuery(query),
"db.parameters_count": params.length,
},
},
async (span: any) => {
const startTime = Date.now();
try {
const result = await executor();
const duration = Date.now() - startTime;
span.setAttributes({
"db.duration_ms": duration,
"db.rows_affected": Array.isArray(result) ? result.length : 1,
"db.success": true,
});
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error: any) {
const duration = Date.now() - startTime;
span.setAttributes({
"db.duration_ms": duration,
"db.success": false,
"db.error_code": error.code,
"db.error_message": error.message,
});
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}
);
}
// External service call tracing with retry and circuit breaker context
async traceExternalCall<T>(
serviceName: string,
operation: string,
url: string,
method: string,
call: () => Promise<T>,
retryCount: number = 0
): Promise<T> {
return this.tracer.startActiveSpan(
`external.${serviceName}.${operation}`,
{
kind: SpanKind.CLIENT,
attributes: {
"http.method": method,
"http.url": url,
"service.name": serviceName,
"service.operation": operation,
"service.retry_count": retryCount,
},
},
async (span: any) => {
const startTime = Date.now();
try {
const result = await call();
const duration = Date.now() - startTime;
span.setAttributes({
"http.response.duration_ms": duration,
"http.response.success": true,
"service.response_size": this.estimateObjectSize(result),
});
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error: any) {
const duration = Date.now() - startTime;
span.setAttributes({
"http.response.duration_ms": duration,
"http.response.success": false,
"http.response.status_code": error.response?.status || 0,
"service.error_type": error.constructor.name,
});
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}
);
}
// Extract correlation context from incoming requests
extractCorrelationContext(
headers: Record<string, string>
): CorrelationContext {
// Extract OpenTelemetry context from headers
const parentContext = propagation.extract(context.active(), headers);
const activeSpan = trace.getSpan(parentContext);
return {
traceId: activeSpan?.spanContext().traceId || "unknown",
spanId: activeSpan?.spanContext().spanId || "unknown",
correlationId:
headers["x-correlation-id"] || this.generateCorrelationId(),
userId: headers["x-user-id"],
userTier: headers["x-user-tier"],
};
}
// Inject correlation context into outgoing requests
injectCorrelationContext(
headers: Record<string, string> = {}
): Record<string, string> {
const activeSpan = trace.getActiveSpan();
// Inject OpenTelemetry context
propagation.inject(context.active(), headers);
// Add custom correlation headers
if (activeSpan) {
headers["x-correlation-id"] = activeSpan.spanContext().traceId;
}
return headers;
}
private extractKeyPattern(key: string): string {
// Extract Redis key patterns for better observability
return key.replace(/\d+/g, "*").replace(/[a-f0-9-]{8,}/g, "*");
}
private sanitizeQuery(query: string): string {
// Remove sensitive data from query for logging
return query.replace(/'\w+'/g, "'***'").substring(0, 200);
}
private estimateObjectSize(obj: any): number {
try {
return JSON.stringify(obj).length;
} catch {
return 0;
}
}
private generateCorrelationId(): string {
return `trace_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
}
}
// Traced HTTP client with automatic context propagation
export class TracedHttpClient {
private tracing: DistributedTracingManager;
constructor() {
this.tracing = DistributedTracingManager.getInstance();
}
async get<T>(url: string, options: RequestOptions = {}): Promise<T> {
return this.tracing.traceExternalCall(
this.extractServiceName(url),
"GET",
url,
"GET",
async () => {
const headers = this.tracing.injectCorrelationContext(
options.headers || {}
);
const response = await axios.get(url, { ...options, headers });
return response.data;
},
options.retryCount
);
}
async post<T>(
url: string,
data: any,
options: RequestOptions = {}
): Promise<T> {
return this.tracing.traceExternalCall(
this.extractServiceName(url),
"POST",
url,
"POST",
async () => {
const headers = this.tracing.injectCorrelationContext(
options.headers || {}
);
const response = await axios.post(url, data, { ...options, headers });
return response.data;
},
options.retryCount
);
}
private extractServiceName(url: string): string {
try {
const urlObj = new URL(url);
return urlObj.hostname.split(".")[0]; // Extract service name from hostname
} catch {
return "unknown-service";
}
}
}
// Complete microservice instrumentation example
export class MicroserviceInstrumentation {
private tracing: DistributedTracingManager;
private httpClient: TracedHttpClient;
constructor() {
this.tracing = DistributedTracingManager.getInstance();
this.httpClient = this.tracing.createTracedHttpClient();
}
// Authentication service with complete tracing
async authenticateUser(
credentials: UserCredentials,
headers: Record<string, string>
): Promise<AuthResult> {
const correlationContext = this.tracing.extractCorrelationContext(headers);
return this.tracing.traceBusinessOperation(
"user.authenticate",
{
type: "authentication",
id: correlationContext.correlationId,
userId: credentials.email,
userTier: "unknown",
correlationId: correlationContext.correlationId,
},
async () => {
// Step 1: Validate credentials with database tracing
const user = await this.tracing.traceDatabaseOperation(
"SELECT",
"users",
"SELECT id, email, password_hash, tier FROM users WHERE email = ?",
[credentials.email],
async () => {
return await db.query(
"SELECT id, email, password_hash, tier FROM users WHERE email = ?",
[credentials.email]
);
}
);
if (
!user ||
!(await bcrypt.compare(credentials.password, user.password_hash))
) {
throw new AuthenticationError("Invalid credentials");
}
// Step 2: Fetch permissions from external service
const permissions = await this.httpClient.get<UserPermissions>(
`${process.env.PERMISSION_SERVICE_URL}/api/permissions/${user.id}`,
{
headers: this.tracing.injectCorrelationContext({}),
retryCount: 0,
}
);
// Step 3: Generate token with external service call
const tokenData = await this.httpClient.post<TokenResponse>(
`${process.env.TOKEN_SERVICE_URL}/api/tokens`,
{ userId: user.id, permissions: permissions.roles },
{
headers: this.tracing.injectCorrelationContext({}),
retryCount: 1,
}
);
return {
token: tokenData.token,
user: {
id: user.id,
email: user.email,
tier: user.tier,
},
permissions: permissions.roles,
};
}
);
}
// Permission service with tracing
async getUserPermissions(
userId: string,
headers: Record<string, string>
): Promise<UserPermissions> {
const correlationContext = this.tracing.extractCorrelationContext(headers);
return this.tracing.traceBusinessOperation(
"permissions.fetch",
{
type: "authorization",
id: correlationContext.correlationId,
userId,
userTier: correlationContext.userTier || "unknown",
correlationId: correlationContext.correlationId,
},
async () => {
// Fetch user permissions with database tracing
const userRoles = await this.tracing.traceDatabaseOperation(
"SELECT",
"user_roles",
"SELECT role_id FROM user_roles WHERE user_id = ?",
[userId],
async () => {
return await db.query(
"SELECT role_id FROM user_roles WHERE user_id = ?",
[userId]
);
}
);
// Fetch role details from cache or database
const roles = await Promise.all(
userRoles.map((userRole: any) =>
this.tracing.traceDatabaseOperation(
"SELECT",
"roles",
"SELECT name, permissions FROM roles WHERE id = ?",
[userRole.role_id],
async () => {
return await db.query(
"SELECT name, permissions FROM roles WHERE id = ?",
[userRole.role_id]
);
}
)
)
);
return {
userId,
roles: roles.map((role) => ({
name: role.name,
permissions: JSON.parse(role.permissions),
})),
};
}
);
}
// Token service with caching and tracing
async generateToken(
tokenRequest: TokenRequest,
headers: Record<string, string>
): Promise<TokenResponse> {
const correlationContext = this.tracing.extractCorrelationContext(headers);
return this.tracing.traceBusinessOperation(
"token.generate",
{
type: "token_generation",
id: correlationContext.correlationId,
userId: tokenRequest.userId,
userTier: correlationContext.userTier || "unknown",
correlationId: correlationContext.correlationId,
},
async () => {
// Check for existing token in Redis with tracing
const existingToken = await this.tracing.traceExternalCall(
"redis",
"GET",
"redis://cache/tokens",
"GET",
async () => {
return await redis.get(`token:${tokenRequest.userId}`);
}
);
if (existingToken) {
return { token: existingToken, expiresIn: 3600 };
}
// Generate new token
const tokenData = {
userId: tokenRequest.userId,
permissions: tokenRequest.permissions,
iat: Math.floor(Date.now() / 1000),
exp: Math.floor(Date.now() / 1000) + 3600, // 1 hour
};
const token = jwt.sign(tokenData, process.env.JWT_SECRET!);
// Cache token with tracing
await this.tracing.traceExternalCall(
"redis",
"SETEX",
"redis://cache/tokens",
"SETEX",
async () => {
await redis.setex(`token:${tokenRequest.userId}`, 3600, token);
}
);
return { token, expiresIn: 3600 };
}
);
}
}
// Supporting interfaces
interface OperationContext {
type: string;
id: string;
userId: string;
userTier: string;
correlationId: string;
}
interface CorrelationContext {
traceId: string;
spanId: string;
correlationId: string;
userId?: string;
userTier?: string;
}
interface RequestOptions {
headers?: Record<string, string>;
timeout?: number;
retryCount?: number;
}
interface UserCredentials {
email: string;
password: string;
}
interface AuthResult {
token: string;
user: {
id: string;
email: string;
tier: string;
};
permissions: string[];
}
interface UserPermissions {
userId: string;
roles: Array<{
name: string;
permissions: string[];
}>;
}
interface TokenRequest {
userId: string;
permissions: string[];
}
interface TokenResponse {
token: string;
expiresIn: number;
}
class AuthenticationError extends Error {
constructor(message: string) {
super(message);
this.name = "AuthenticationError";
}
}
Intelligent Alerting: From Noise to Actionable Insights
The Problem: Alert Fatigue and Meaningless Notifications
# The alerting nightmare that trains everyone to ignore critical issues
groups:
- name: everything_is_an_emergency
rules:
# Alert on everything - RED FLAG #1
- alert: CPUUsageHigh
expr: cpu_usage_percent > 50 # Way too sensitive - RED FLAG #2
for: 1m
annotations:
summary: "CPU usage is high" # Meaningless message - RED FLAG #3
- alert: MemoryUsageHigh
expr: memory_usage_percent > 60 # No context - RED FLAG #4
for: 30s # Too quick to fire - RED FLAG #5
annotations:
summary: "Memory usage is high"
- alert: DiskUsageHigh
expr: disk_usage_percent > 70
for: 1m
annotations:
summary: "Disk usage is high"
- alert: HTTPRequests
expr: rate(http_requests_total[5m]) > 100 # Arbitrary threshold - RED FLAG #6
for: 1m
annotations:
summary: "HTTP requests are high"
- alert: DatabaseConnections
expr: database_connections > 50 # No scaling context - RED FLAG #7
for: 1m
annotations:
summary: "Database connections are high"
- alert: LogErrors
expr: rate(log_errors_total[5m]) > 0 # Alert on any error - RED FLAG #8
for: 30s
annotations:
summary: "Errors detected in logs"
- alert: ResponseTime
expr: http_response_time > 500ms # Same threshold for all endpoints - RED FLAG #9
for: 1m
annotations:
summary: "Response time is slow"
# Problems this creates:
# - 15,000+ alerts per day that developers learn to ignore
# - No correlation between alerts and actual user impact
# - Same priority for minor issues and critical outages
# - No context about normal vs abnormal behavior
# - No actionable information for resolving issues
# - Alert fatigue leads to missing real emergencies
The Solution: Contextual, Intelligent Alerting with User Impact Focus
// Advanced alerting system with context, correlation, and user impact assessment
import { PrometheusRegistry, Counter, Histogram, Gauge } from "prom-client";
import { AlertManager, Alert, AlertSeverity } from "./alerting/types";
import { CorrelationEngine } from "./alerting/correlation";
import { UserImpactCalculator } from "./alerting/impact";
export class IntelligentAlertingSystem {
private correlationEngine: CorrelationEngine;
private impactCalculator: UserImpactCalculator;
private alertSuppressionMap: Map<string, AlertSuppressionRule> = new Map();
private alertHistory: Map<string, AlertHistory[]> = new Map();
constructor() {
this.correlationEngine = new CorrelationEngine();
this.impactCalculator = new UserImpactCalculator();
this.setupIntelligentAlerts();
}
private setupIntelligentAlerts(): void {
this.registerBusinessTransactionAlerts();
this.registerUserExperienceAlerts();
this.registerInfrastructureAlerts();
this.registerAnomalyDetectionAlerts();
}
// Business transaction focused alerts
private registerBusinessTransactionAlerts(): void {
// Payment processing failure rate
this.createSmartAlert({
name: "payment_processing_failure_rate",
description:
"Payment processing failure rate is above acceptable threshold",
query:
'rate(business_transactions_total{transaction_type="payment.process",outcome="failure"}[5m]) / rate(business_transactions_total{transaction_type="payment.process"}[5m]) > 0.01',
severity: AlertSeverity.CRITICAL,
evaluation: {
for: "2m", // Wait 2 minutes to avoid false positives
evaluateEvery: "30s",
},
context: {
businessImpact: "Direct revenue loss from failed payments",
userImpact: "Users cannot complete purchases",
slaImpact: "99.9% payment success rate SLA breach",
runbook: "https://runbooks.company.com/payment-failures",
},
dynamicThreshold: {
baselineWindow: "1h",
deviationMultiplier: 3.0,
minimumSamples: 100,
},
suppressionRules: [
{
condition: "maintenance_window_active",
duration: "4h",
},
],
});
// User registration flow disruption
this.createSmartAlert({
name: "user_registration_flow_disruption",
description: "User registration success rate has dropped significantly",
query:
'rate(business_transactions_total{transaction_type="user.register",outcome="success"}[10m]) / rate(business_transactions_total{transaction_type="user.register"}[10m]) < 0.95',
severity: AlertSeverity.HIGH,
evaluation: {
for: "5m",
evaluateEvery: "1m",
},
context: {
businessImpact: "Reduced new user acquisition",
userImpact: "New users cannot create accounts",
slaImpact: "95% registration success rate SLA breach",
},
correlationRules: [
"check_authentication_service_health",
"check_email_service_health",
"check_database_connectivity",
],
});
}
// User experience focused alerts
private registerUserExperienceAlerts(): void {
// Page load time degradation
this.createSmartAlert({
name: "user_experience_degradation",
description:
"User experience is degraded based on response time percentiles",
query:
"histogram_quantile(0.95, rate(user_experience_response_time_bucket[5m])) > 2.0",
severity: AlertSeverity.HIGH,
evaluation: {
for: "3m",
evaluateEvery: "1m",
},
context: {
businessImpact: "Increased bounce rate and user dissatisfaction",
userImpact: "Slow page loads affecting user experience",
slaImpact: "2 second 95th percentile response time SLA breach",
},
smartThreshold: {
timeOfDay: {
peak_hours: { threshold: 2.0, multiplier: 1.0 },
off_hours: { threshold: 1.5, multiplier: 1.2 },
},
userTier: {
premium: { threshold: 1.5, priority: "high" },
free: { threshold: 3.0, priority: "medium" },
},
},
});
// Search functionality degradation
this.createSmartAlert({
name: "search_functionality_degradation",
description: "Search functionality is experiencing performance issues",
query:
'rate(business_transactions_total{transaction_type="search.query",outcome="success"}[5m]) < 0.98',
severity: AlertSeverity.MEDIUM,
evaluation: {
for: "3m",
evaluateEvery: "1m",
},
context: {
businessImpact: "Reduced product discovery and conversion",
userImpact: "Users having difficulty finding products",
slaImpact: "98% search success rate SLA breach",
},
});
}
// Infrastructure alerts with business context
private registerInfrastructureAlerts(): void {
// Database connection pool exhaustion
this.createSmartAlert({
name: "database_connection_pool_exhaustion",
description: "Database connection pool is near exhaustion",
query: "database_connections_active / database_connections_max > 0.85",
severity: AlertSeverity.HIGH,
evaluation: {
for: "2m",
evaluateEvery: "30s",
},
context: {
businessImpact: "Potential for complete service outage",
userImpact: "Users may experience timeout errors",
technicalImpact: "Database query failures imminent",
runbook: "https://runbooks.company.com/database-connections",
},
escalation: {
levels: [
{ after: "5m", severity: AlertSeverity.CRITICAL },
{
after: "10m",
severity: AlertSeverity.CRITICAL,
notify: ["on-call-manager"],
},
],
},
});
// Memory leak detection
this.createSmartAlert({
name: "memory_leak_detection",
description: "Potential memory leak detected based on usage trends",
query: "increase(process_memory_usage_bytes[30m]) > 100000000", // 100MB increase in 30 minutes
severity: AlertSeverity.MEDIUM,
evaluation: {
for: "5m",
evaluateEvery: "1m",
},
context: {
businessImpact: "Service instability and potential crashes",
technicalImpact: "Memory exhaustion will lead to OOM kills",
runbook: "https://runbooks.company.com/memory-leaks",
},
trendAnalysis: {
window: "2h",
projectedFailureTime: true,
},
});
}
// Anomaly detection based alerts
private registerAnomalyDetectionAlerts(): void {
// Traffic pattern anomaly
this.createSmartAlert({
name: "traffic_pattern_anomaly",
description: "Unusual traffic pattern detected",
query:
"abs(rate(http_requests_total[5m]) - avg_over_time(rate(http_requests_total[5m])[1h:5m])) / avg_over_time(rate(http_requests_total[5m])[1h:5m]) > 0.5",
severity: AlertSeverity.MEDIUM,
evaluation: {
for: "3m",
evaluateEvery: "1m",
},
context: {
businessImpact: "Potential DDoS attack or viral content",
technicalImpact: "Infrastructure may be overwhelmed",
investigationSteps: [
"Check traffic sources and geographic distribution",
"Verify CDN and load balancer performance",
"Check for marketing campaigns or viral content",
],
},
anomalyDetection: {
algorithm: "seasonal_decomposition",
seasonality: "1d",
sensitivity: "medium",
},
});
}
// Smart alert creation with context and intelligence
private createSmartAlert(config: SmartAlertConfig): void {
const alert: SmartAlert = {
...config,
id: this.generateAlertId(config.name),
createdAt: new Date(),
evaluationHistory: [],
suppressionState: "active",
};
// Set up dynamic threshold if configured
if (config.dynamicThreshold) {
alert.currentThreshold = this.calculateDynamicThreshold(
config.query,
config.dynamicThreshold
);
}
// Register alert with monitoring system
this.registerAlertRule(alert);
console.log(`Registered smart alert: ${config.name}`);
}
// Alert evaluation with context and correlation
async evaluateAlert(
alert: SmartAlert,
currentValue: number
): Promise<AlertEvaluationResult> {
const evaluationContext: AlertEvaluationContext = {
timestamp: new Date(),
value: currentValue,
businessContext: await this.getBusinessContext(),
systemContext: await this.getSystemContext(),
userImpactAssessment: await this.impactCalculator.calculateImpact(
alert,
currentValue
),
};
// Check suppression rules
const suppressionCheck = this.checkSuppressionRules(
alert,
evaluationContext
);
if (suppressionCheck.shouldSuppress) {
return {
shouldFire: false,
reason: suppressionCheck.reason,
context: evaluationContext,
};
}
// Check if threshold is breached
const thresholdBreach = this.evaluateThreshold(
alert,
currentValue,
evaluationContext
);
if (!thresholdBreach.isBreached) {
return {
shouldFire: false,
reason: "Threshold not breached",
context: evaluationContext,
};
}
// Perform correlation analysis
const correlatedEvents = await this.correlationEngine.findCorrelatedEvents(
alert,
evaluationContext
);
// Calculate alert priority based on context
const priority = this.calculateAlertPriority(
alert,
evaluationContext,
correlatedEvents
);
return {
shouldFire: true,
priority,
correlatedEvents,
context: evaluationContext,
enrichedAlert: this.enrichAlertWithContext(
alert,
evaluationContext,
correlatedEvents
),
};
}
// Enrich alert with actionable context
private enrichAlertWithContext(
alert: SmartAlert,
context: AlertEvaluationContext,
correlatedEvents: CorrelatedEvent[]
): EnrichedAlert {
return {
id: alert.id,
name: alert.name,
severity: alert.severity,
description: alert.description,
currentValue: context.value,
threshold: alert.currentThreshold,
businessImpact: {
description: alert.context.businessImpact,
estimatedRevenueLoss: this.estimateRevenueLoss(alert, context),
affectedUserCount: context.userImpactAssessment.affectedUsers,
slaBreachRisk: alert.context.slaImpact,
},
technicalContext: {
correlatedEvents,
systemHealth: context.systemContext,
recentChanges: await this.getRecentSystemChanges(),
suggestedActions: this.generateSuggestedActions(
alert,
correlatedEvents
),
},
investigationContext: {
runbookUrl: alert.context.runbook,
relatedDashboards: this.getRelatedDashboards(alert),
keyMetrics: await this.getKeyMetricsSnapshot(alert),
similarIncidents: await this.findSimilarHistoricalIncidents(alert),
},
notificationContext: {
urgency: this.calculateUrgency(alert, context),
escalationPath: alert.escalation,
suppressionRules: alert.suppressionRules,
notificationChannels: this.selectNotificationChannels(alert, context),
},
};
}
// Dynamic threshold calculation based on historical data
private async calculateDynamicThreshold(
query: string,
config: DynamicThresholdConfig
): Promise<number> {
// Get historical data for baseline
const historicalData = await this.queryHistoricalData(
query,
config.baselineWindow
);
if (historicalData.length < config.minimumSamples) {
return config.fallbackThreshold || 0;
}
// Calculate statistical threshold
const mean =
historicalData.reduce((sum, val) => sum + val, 0) / historicalData.length;
const stdDev = Math.sqrt(
historicalData.reduce((sum, val) => sum + Math.pow(val - mean, 2), 0) /
historicalData.length
);
return mean + stdDev * config.deviationMultiplier;
}
// Correlation engine for finding related events
private async findCorrelatedEvents(
alert: SmartAlert,
context: AlertEvaluationContext
): Promise<CorrelatedEvent[]> {
const timeWindow = 10; // 10 minutes
const correlatedEvents: CorrelatedEvent[] = [];
// Check for correlated alerts
const recentAlerts = await this.getRecentAlerts(timeWindow);
for (const recentAlert of recentAlerts) {
const correlation = this.calculateCorrelation(alert, recentAlert);
if (correlation.strength > 0.7) {
correlatedEvents.push({
type: "alert",
correlation,
event: recentAlert,
});
}
}
// Check for recent deployments
const recentDeployments = await this.getRecentDeployments(timeWindow);
for (const deployment of recentDeployments) {
correlatedEvents.push({
type: "deployment",
correlation: { strength: 0.8, type: "temporal" },
event: deployment,
});
}
// Check for infrastructure changes
const infraChanges = await this.getRecentInfrastructureChanges(timeWindow);
for (const change of infraChanges) {
correlatedEvents.push({
type: "infrastructure",
correlation: { strength: 0.6, type: "causal" },
event: change,
});
}
return correlatedEvents;
}
// Generate actionable suggestions based on alert type and context
private generateSuggestedActions(
alert: SmartAlert,
correlatedEvents: CorrelatedEvent[]
): string[] {
const actions: string[] = [];
// Alert-specific actions
const alertTypeActions = this.getAlertTypeSpecificActions(alert);
actions.push(...alertTypeActions);
// Context-based actions
if (correlatedEvents.some((e) => e.type === "deployment")) {
actions.push("Consider rolling back recent deployment");
actions.push("Check deployment logs for errors");
}
if (correlatedEvents.some((e) => e.type === "infrastructure")) {
actions.push("Verify infrastructure changes are properly applied");
actions.push("Check for configuration drift");
}
// Historical incident actions
const similarIncidents = await this.findSimilarHistoricalIncidents(alert);
if (similarIncidents.length > 0) {
const commonResolutions = this.extractCommonResolutions(similarIncidents);
actions.push(...commonResolutions);
}
return [...new Set(actions)]; // Remove duplicates
}
// Supporting methods for business context
private async getBusinessContext(): Promise<BusinessContext> {
return {
currentPromotions: await this.getCurrentPromotions(),
peakTrafficPeriod: this.isPeakTrafficPeriod(),
maintenanceWindows: await this.getActiveMaintenanceWindows(),
criticalBusinessPeriods: this.isCriticalBusinessPeriod(),
};
}
private async getSystemContext(): Promise<SystemContext> {
return {
overallSystemHealth: await this.getOverallSystemHealth(),
recentDeployments: await this.getRecentDeployments(60), // 1 hour
activeIncidents: await this.getActiveIncidents(),
systemLoad: await this.getCurrentSystemLoad(),
};
}
private calculateUrgency(
alert: SmartAlert,
context: AlertEvaluationContext
): AlertUrgency {
let urgencyScore = 0;
// Base urgency from severity
switch (alert.severity) {
case AlertSeverity.CRITICAL:
urgencyScore += 40;
break;
case AlertSeverity.HIGH:
urgencyScore += 30;
break;
case AlertSeverity.MEDIUM:
urgencyScore += 20;
break;
case AlertSeverity.LOW:
urgencyScore += 10;
break;
}
// Business context multipliers
if (context.businessContext.peakTrafficPeriod) urgencyScore *= 1.5;
if (context.businessContext.criticalBusinessPeriods) urgencyScore *= 2.0;
// User impact multipliers
if (context.userImpactAssessment.affectedUsers > 10000) urgencyScore *= 1.8;
if (context.userImpactAssessment.revenueImpact > 10000) urgencyScore *= 2.2;
if (urgencyScore > 80) return AlertUrgency.IMMEDIATE;
if (urgencyScore > 60) return AlertUrgency.HIGH;
if (urgencyScore > 40) return AlertUrgency.MEDIUM;
return AlertUrgency.LOW;
}
private generateAlertId(name: string): string {
return `alert_${name}_${Date.now()}_${Math.random()
.toString(36)
.substr(2, 6)}`;
}
// Supporting placeholder methods (would be implemented with actual data sources)
private async queryHistoricalData(
query: string,
window: string
): Promise<number[]> {
return [];
}
private async getRecentAlerts(minutes: number): Promise<any[]> {
return [];
}
private async getRecentDeployments(minutes: number): Promise<any[]> {
return [];
}
private async getRecentInfrastructureChanges(
minutes: number
): Promise<any[]> {
return [];
}
private async getCurrentPromotions(): Promise<string[]> {
return [];
}
private isPeakTrafficPeriod(): boolean {
return false;
}
private async getActiveMaintenanceWindows(): Promise<any[]> {
return [];
}
private isCriticalBusinessPeriod(): boolean {
return false;
}
private async getOverallSystemHealth(): Promise<any> {
return {};
}
private async getActiveIncidents(): Promise<any[]> {
return [];
}
private async getCurrentSystemLoad(): Promise<any> {
return {};
}
private async getRecentSystemChanges(): Promise<any[]> {
return [];
}
private getRelatedDashboards(alert: SmartAlert): string[] {
return [];
}
private async getKeyMetricsSnapshot(alert: SmartAlert): Promise<any> {
return {};
}
private async findSimilarHistoricalIncidents(
alert: SmartAlert
): Promise<any[]> {
return [];
}
private selectNotificationChannels(
alert: SmartAlert,
context: AlertEvaluationContext
): string[] {
return [];
}
private estimateRevenueLoss(
alert: SmartAlert,
context: AlertEvaluationContext
): number {
return 0;
}
private calculateCorrelation(alert1: SmartAlert, alert2: any): any {
return { strength: 0, type: "none" };
}
private getAlertTypeSpecificActions(alert: SmartAlert): string[] {
return [];
}
private extractCommonResolutions(incidents: any[]): string[] {
return [];
}
private calculateAlertPriority(
alert: SmartAlert,
context: AlertEvaluationContext,
events: CorrelatedEvent[]
): number {
return 1;
}
private evaluateThreshold(
alert: SmartAlert,
value: number,
context: AlertEvaluationContext
): { isBreached: boolean } {
return { isBreached: true };
}
private checkSuppressionRules(
alert: SmartAlert,
context: AlertEvaluationContext
): { shouldSuppress: boolean; reason?: string } {
return { shouldSuppress: false };
}
private registerAlertRule(alert: SmartAlert): void {}
}
// Supporting interfaces and types
interface SmartAlertConfig {
name: string;
description: string;
query: string;
severity: AlertSeverity;
evaluation: {
for: string;
evaluateEvery: string;
};
context: {
businessImpact: string;
userImpact?: string;
slaImpact?: string;
technicalImpact?: string;
runbook?: string;
investigationSteps?: string[];
};
dynamicThreshold?: DynamicThresholdConfig;
smartThreshold?: SmartThresholdConfig;
suppressionRules?: AlertSuppressionRule[];
correlationRules?: string[];
escalation?: AlertEscalation;
trendAnalysis?: TrendAnalysisConfig;
anomalyDetection?: AnomalyDetectionConfig;
}
interface SmartAlert extends SmartAlertConfig {
id: string;
createdAt: Date;
evaluationHistory: any[];
suppressionState: string;
currentThreshold?: number;
}
enum AlertSeverity {
LOW = "low",
MEDIUM = "medium",
HIGH = "high",
CRITICAL = "critical",
}
enum AlertUrgency {
LOW = "low",
MEDIUM = "medium",
HIGH = "high",
IMMEDIATE = "immediate",
}
interface DynamicThresholdConfig {
baselineWindow: string;
deviationMultiplier: number;
minimumSamples: number;
fallbackThreshold?: number;
}
interface SmartThresholdConfig {
timeOfDay?: Record<string, { threshold: number; multiplier?: number }>;
userTier?: Record<string, { threshold: number; priority?: string }>;
}
interface AlertSuppressionRule {
condition: string;
duration: string;
}
interface AlertEscalation {
levels: Array<{
after: string;
severity: AlertSeverity;
notify?: string[];
}>;
}
interface TrendAnalysisConfig {
window: string;
projectedFailureTime: boolean;
}
interface AnomalyDetectionConfig {
algorithm: string;
seasonality: string;
sensitivity: string;
}
interface AlertEvaluationContext {
timestamp: Date;
value: number;
businessContext: BusinessContext;
systemContext: SystemContext;
userImpactAssessment: UserImpactAssessment;
}
interface BusinessContext {
currentPromotions: string[];
peakTrafficPeriod: boolean;
maintenanceWindows: any[];
criticalBusinessPeriods: boolean;
}
interface SystemContext {
overallSystemHealth: any;
recentDeployments: any[];
activeIncidents: any[];
systemLoad: any;
}
interface UserImpactAssessment {
affectedUsers: number;
revenueImpact: number;
serviceDegradation: string;
}
interface CorrelatedEvent {
type: "alert" | "deployment" | "infrastructure" | "business";
correlation: {
strength: number;
type: string;
};
event: any;
}
interface EnrichedAlert {
id: string;
name: string;
severity: AlertSeverity;
description: string;
currentValue: number;
threshold?: number;
businessImpact: {
description: string;
estimatedRevenueLoss: number;
affectedUserCount: number;
slaBreachRisk?: string;
};
technicalContext: {
correlatedEvents: CorrelatedEvent[];
systemHealth: any;
recentChanges: any[];
suggestedActions: string[];
};
investigationContext: {
runbookUrl?: string;
relatedDashboards: string[];
keyMetrics: any;
similarIncidents: any[];
};
notificationContext: {
urgency: AlertUrgency;
escalationPath?: AlertEscalation;
suppressionRules?: AlertSuppressionRule[];
notificationChannels: string[];
};
}
interface AlertEvaluationResult {
shouldFire: boolean;
reason?: string;
priority?: number;
correlatedEvents?: CorrelatedEvent[];
context: AlertEvaluationContext;
enrichedAlert?: EnrichedAlert;
}
interface AlertHistory {
timestamp: Date;
value: number;
fired: boolean;
suppressed: boolean;
}
This comprehensive backend monitoring and observability guide gives you:
- Strategic observability implementation focused on user experience and business outcomes rather than vanity metrics
- Advanced distributed tracing that provides end-to-end visibility across complex microservices architectures
- Intelligent alerting systems that reduce noise while providing actionable insights with business context
- Production-ready monitoring patterns that scale with system complexity and team growth
- Contextual incident response that helps teams resolve issues faster with correlated data and suggested actions
The difference between monitoring systems that prevent disasters and those that hide them isn’t just collecting more data—it’s understanding user impact, business context, and system behavior patterns to create actionable insights that help teams deliver reliable software experiences.