· 19 min read

The Only Guide You'd Ever Need for Load Balancers - 5

Health Checking - Don’t Send Traffic to Dead Servers

Welcome back, sorry for taking so long, but life happens. If you’re coming from part 4, you just built a working load balancer. Actual code, actual traffic distribution, actual RR. Pretty cool, right?

But remember what happened when we killed one of the backend servers? 33% of our requests started failing. Our load balancer kept happily sending traffic to a dead server like nothing was wrong. That’s not just not optimal, that’s embarrassing.

In this part, we’re going to fix that. We’re going to teach our load balancer to detect when servers die and stop sending them traffic. This is called health checking, because we…check their health? I don’t know how I could’ve worded that any better, haha.


Servers Die

Let me paint you a picture. It’s 2 AM. You’re a sleeping beauty. Your wingman dating app has three backend servers, all happily serving users looking for love.

Then Server 2 crashes. Maybe the disk filled up. Maybe a memory leak finally hit because of your vibe coded app. Doesn’t matter. Server 2 is dead.

Before crash:

Before server 2 crashed

After server 2 crashes:

Before server 2 crashed

Without health checking, your load balancer has no way of knowing which servers are alive and which are dead. It just keeps rotating through the list, sending traffic to dead servers.


Types of Server Failures

Before we implement health checking, let’s understand what can actually go wrong. Servers don’t just “die” in one way. There are multiple failure modes, and they all suck differently.

1. Complete Server Down

The server machine is completely off. Power failure, hardware failure, someone unplugged it.

Complete server down image

How it manifests: Connection refused or timeout. The TCP handshake never completes.

2. Process Crash

The server machine is fine, but the application crashed. The OS is running, but your web server process isn’t.

Process crash image

How it manifests: Connection refused. The port isn’t listening.

3. Network Partition

The server is fine, the application is fine, but there’s a network problem between the load balancer and the server.

Network partition problem

How it manifests: Connection timeout. Packets are being dropped or lost.

4. Overloaded / Slow Server

This is the sneaky one. The server is technically “alive” but it’s so overwhelmed that it might as well be dead.

Overload problem

How it manifests: Connections succeed, but responses take forever or never come.

Why This Matters

Different failure types need different detection strategies:

Failure TypeDetection Method
Server downTCP connection fails
Process crashTCP connection refused
Network partitionTCP connection timeout
Overloaded serverResponse timeout / slow response

A good health checking system needs to handle ALL of these.


Active vs Passive Health Checks

There are two approaches to detecting server failures. Both have their place.

Passive Health Checks

“Learn from real traffic”

With passive health checks, you don’t actively probe servers. Instead, you watch what happens when you send real user traffic.

Request 1 ──► Server 2 ──► Success ✓
Request 2 ──► Server 2 ──► Success ✓
Request 3 ──► Server 2 ──► FAILED ✗
Request 4 ──► Server 2 ──► FAILED ✗
Request 5 ──► Server 2 ──► FAILED ✗

"Hmm, 3 failures in a row..."
"Maybe Server 2 is dead?"
"Let me stop sending traffic to it"

Pros:

  • No extra network traffic

Cons:

  • Users experience the failures (that’s how you detect them)
  • Slow to detect (need multiple failures)

Active Health Checks

“Proactively probe servers”

With active health checks, you continuously send test requests to servers to check if they’re alive. Real user traffic is separate.

Health Checker (background process):

Every 5 seconds:
    ├── Probe Server 1 ──► Response ✓ (healthy)
    ├── Probe Server 2 ──► No response ✗ (unhealthy)
    └── Probe Server 3 ──► Response ✓ (healthy)

Real traffic only goes to healthy servers:
    User Request ──► Server 1 or Server 3 (not 2)

Pros:

  • Detects failures BEFORE users hit them
  • Fast detection (configurable interval)

Cons:

  • Extra network traffic

Which Should We Use?

Both.

Seriously.

  • Active health checks for proactive detection
  • Passive health checks as a backup (if active checks miss something)

For this blog, we’ll implement active health checks first. They give us the most bang for our buck and are what most people think of when they hear “health checking.”


Active Health Check Types

There are different ways to probe a server. Let’s start from the simplest.

1. TCP Health Check (Simplest)

Just try to establish a TCP connection. If it succeeds, the server is “alive.”

A simple TCP health check

What it checks:

  • Server machine is running
  • Port is listening
  • Network path is working

What it DOESN’T check:

  • Application is working correctly
  • Database connections are fine
  • Dependencies are healthy

When to use: When you just need basic “is the port open” checks.

2. HTTP Health Check (Most Common)

Send an actual HTTP request to a health endpoint. Check the response.

An HTTP health check

What it checks:

  • Everything TCP check does, PLUS
  • HTTP server is responding
  • App code is running
  • Dependencies are healthy

When to use: Most of the time.

3. Custom Protocol Health Checks

For non-HTTP services (databases, message queues, custom protocols), you might need custom health checks.

Examples:
- MySQL: Run "SELECT 1" query
- Redis: Send PING, expect PONG
- gRPC: Call Health.Check RPC
- Custom TCP: Send magic bytes, check response

When to use: When your backend isn’t HTTP based.

For our load balancer, we’ll implement HTTP health checks since our backend servers are HTTP servers.


Health Check Parameters

Before we write code, let’s understand the knobs we can turn.

Interval

How often do we check server health?

Interval = 5 seconds:

Time:  0s     5s     10s    15s    20s    25s
       │      │      │      │      │      │
       ▼      ▼      ▼      ▼      ▼      ▼
     Check  Check  Check  Check  Check  Check
  • Too frequent (100ms): Waste bandwidth, stress servers
  • Too infrequent (60s): Slow to detect failures
  • Sweet spot: Usually 5-10 seconds for most applications

Timeout

How long do we wait for a health check response before giving up?

Timeout = 3 seconds:

Health Checker                         Server
      │                                   │
      │──── Health check request ────────►│
      │                                   │
      │     ... waiting ...               │
      │     ... still waiting ...         │
      │     ... 3 seconds pass ...        │
      │                                   │
      │     TIMEOUT - server is UNHEALTHY │
  • Too short (100ms): False positives on slow responses
  • Too long (30s): Takes forever to detect failures
  • Sweet spot: Usually 2-5 seconds

Threshold (Consecutive Failures)

How many consecutive failures before we mark a server as unhealthy?

Threshold = 3 consecutive failures:

Check 1: FAIL   ─► Server still marked healthy (1/3 failures)
Check 2: FAIL   ─► Server still marked healthy (2/3 failures)
Check 3: FAIL   ─► Server marked UNHEALTHY (3/3 failures)

Why not mark unhealthy on first failure?

  • Network blips happen
  • Single request timeouts are normal
  • We want to avoid “flapping” (dw, explained in the diagram below)
┌───────────────────────────────────────────────────────────────┐
│                                                               │
│   Without threshold (mark unhealthy on first failure):        │
│                                                               │
│   Time: ──────────────────────────────────────────────►       │
│                                                               │
│   Status: HEALTHY ─► UNHEALTHY ─► HEALTHY ─► UNHEALTHY ─►     │
│                 ↑           ↑           ↑           ↑         │
│              1 fail     1 success   1 fail      1 success     │
│                                                               │
│   This is called "flapping" - constant status changes         │
│   Very bad for routing stability                              │
│                                                               │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│   With threshold = 3:                                         │
│                                                               │
│   Time: ──────────────────────────────────────────────►       │
│                                                               │
│   Status: HEALTHY ────────────────────► UNHEALTHY ──────►     │
│                                    ↑                          │
│                          3 consecutive failures               │
│                                                               │
│   Much more stable                                            │
│                                                               │
└───────────────────────────────────────────────────────────────┘

Rise Count

How many consecutive successes before we mark a server as healthy again?

Rise = 2 consecutive successes:

Server was unhealthy, now recovering:

Check 1: SUCCESS ─► Still unhealthy (1/2 successes)
Check 2: SUCCESS ─► Marked HEALTHY again (2/2 successes)

This prevents a server that’s unstably recovering from immediately getting traffic.

Our Configuration

For our implementation, we’ll use:

ParameterValueWhy
Interval5 secondsGood balance
Timeout3 secondsLong enough for slow responses
Unhealthy Threshold3 failuresAvoid flapping
Healthy Threshold2 successesQuick recovery

Implementation Time

Alright, enough theory. Let’s write the code.

Updating Our Backend Structure

First, we need to track health status for each backend:

type Backend struct {
    Host   string
    Port   int
    Alive  bool       // is this server healthy?
    mux    sync.RWMutex // for thread safe status updates
}

// getter
func (b *Backend) IsAlive() bool {
    b.mux.RLock()
    defer b.mux.RUnlock()
    return b.Alive
}

// setter
func (b *Backend) SetAlive(alive bool) {
    b.mux.Lock()
    defer b.mux.Unlock()
    b.Alive = alive
}

Notice we’re using sync.RWMutex instead of sync.Mutex. This allows multiple goroutines to read simultaneously (using RLock), but only one can write at a time. Since we read Alive status way more than we write it, this is obviously better here.

The Health Checker Structure

type HealthChecker struct {
    pool               *ServerPool
    checkInterval      time.Duration
    timeout            time.Duration
    unhealthyThreshold int
    healthyThreshold   int
    healthPath         string

    // track failures/successes per backend
    failureCounts map[string]int
    successCounts map[string]int
    countsMux     sync.Mutex
}

func NewHealthChecker(pool *ServerPool) *HealthChecker {
    return &HealthChecker{
        pool:               pool,
        checkInterval:      5 * time.Second,
        timeout:            3 * time.Second,
        unhealthyThreshold: 3,
        healthyThreshold:   2,
        healthPath:         "/health",
        failureCounts:      make(map[string]int),
        successCounts:      make(map[string]int),
    }
}

The Core Health Check Logic

Here’s the function that actually checks if a single server is healthy:

func (hc *HealthChecker) checkBackend(backend *Backend) bool {
    url := fmt.Sprintf("http://%s:%d%s", backend.Host, backend.Port, hc.healthPath)

    // create HTTP client with our timeout
    client := &http.Client{
        Timeout: hc.timeout,
    }

    resp, err := client.Get(url)
    if err != nil {
        // connection failed, timeout, or other error
        return false
    }
    defer resp.Body.Close()

    // we consider 2xx status codes as healthy
    return resp.StatusCode >= 200 && resp.StatusCode < 300
}

Simple, right? We make an HTTP GET request to /health and check if we get a 2xx response.

Handling Consecutive Failures/Successes

Now the logic that tracks consecutive failures and updates server status:

func (hc *HealthChecker) processHealthCheckResult(backend *Backend, isHealthy bool) {
    hc.countsMux.Lock()
    defer hc.countsMux.Unlock()

    key := fmt.Sprintf("%s:%d", backend.Host, backend.Port)
    wasAlive := backend.IsAlive()

    if isHealthy {
        hc.failureCounts[key] = 0
        hc.successCounts[key]++

        // if server was down and we've hit the healthy threshold, bring it back
        if !wasAlive && hc.successCounts[key] >= hc.healthyThreshold {
            backend.SetAlive(true)
            log.Printf("[HEALTH] Server %s is now HEALTHY (after %d successful checks)",
                key, hc.successCounts[key])
            hc.successCounts[key] = 0
        }
    } else {
        hc.successCounts[key] = 0
        hc.failureCounts[key]++

        // if server was up and we've hit the unhealthy threshold, mark it down
        if wasAlive && hc.failureCounts[key] >= hc.unhealthyThreshold {
            backend.SetAlive(false)
            log.Printf("[HEALTH] Server %s is now UNHEALTHY (after %d failed checks)",
                key, hc.failureCounts[key])
            hc.failureCounts[key] = 0
        } else if wasAlive {
            log.Printf("[HEALTH] Server %s failed check (%d/%d)",
                key, hc.failureCounts[key], hc.unhealthyThreshold)
        }
    }
}

This is basically what’s happening:

Server starts HEALTHY:

Check 1: FAIL
  failureCounts["server1"] = 1
  Log: "Server server1 failed check (1/3)"
  Status: still HEALTHY

Check 2: FAIL
  failureCounts["server1"] = 2
  Log: "Server server1 failed check (2/3)"
  Status: still HEALTHY

Check 3: FAIL
  failureCounts["server1"] = 3
  Log: "Server server1 is now UNHEALTHY"
  Status: UNHEALTHY ← now we stop sending traffic

Check 4: SUCCESS
  successCounts["server1"] = 1
  failureCounts["server1"] = 0 (reset)
  Status: still UNHEALTHY

Check 5: SUCCESS
  successCounts["server1"] = 2
  Log: "Server server1 is now HEALTHY"
  Status: HEALTHY ← back in rotation :)

The Background Health Check Loop

This runs continuously, checking all servers at the configured interval:

func (hc *HealthChecker) Start() {
    log.Printf("[HEALTH] Starting health checker (interval: %v, timeout: %v)",
        hc.checkInterval, hc.timeout)

    ticker := time.NewTicker(hc.checkInterval)
    defer ticker.Stop()

    // initial check immediately
    hc.checkAllBackends()

    for range ticker.C {
        hc.checkAllBackends()
    }
}

func (hc *HealthChecker) checkAllBackends() {
    backends := hc.pool.GetAllBackends()

    for _, backend := range backends {
        go func(b *Backend) {
            isHealthy := hc.checkBackend(b)
            hc.processHealthCheckResult(b, isHealthy)
        }(backend)
    }
}

We check all backends in parallel using goroutines. No point waiting for one slow server to time out before checking others.

Health checker loop

Updating the Server Pool

We need to update GetNextBackend to skip unhealthy servers:

func (p *ServerPool) GetNextBackend() *Backend {
    p.mux.Lock()
    defer p.mux.Unlock()

    if len(p.backends) == 0 {
        return nil
    }

    // try to find a healthy backend
    // we'll try at most len(backends) times to avoid infinite loop
    for i := 0; i < len(p.backends); i++ {
        backend := p.backends[p.current]
        p.current = (p.current + 1) % len(p.backends)

        if backend.IsAlive() {
            return backend
        }
    }

    return nil
}

// need this for health checker
func (p *ServerPool) GetAllBackends() []*Backend {
    p.mux.RLock()
    defer p.mux.RUnlock()

    result := make([]*Backend, len(p.backends))
    for i := range p.backends {
        result[i] = p.backends[i]
    }
    return result
}

The key change: we now skip backends where IsAlive() returns false.

Before (no health checking):

Backends: [Server1, Server2(dead), Server3]
Current index: 0

GetNextBackend() → Server1, index becomes 1
GetNextBackend() → Server2 (DEAD! User gets error), index becomes 2
GetNextBackend() → Server3, index becomes 0
GetNextBackend() → Server1, index becomes 1
...

After (with health checking):

Backends: [Server1(alive), Server2(dead), Server3(alive)]
Current index: 0

GetNextBackend() → Server1 (alive), return it, index becomes 1
GetNextBackend() → Server2 (dead, skip), Server3 (alive), return it, index becomes 0
GetNextBackend() → Server1 (alive), return it, index becomes 1
...

Dead servers are automatically skipped

Updating AddBackend

New servers should start as healthy (assume they’re good until somethin bad happens, like innocent until guilty type of thing):

func (p *ServerPool) AddBackend(host string, port int) {
    backend := &Backend{
        Host:  host,
        Port:  port,
        Alive: true, // start as healthy
    }
    p.backends = append(p.backends, backend)
    log.Printf("[POOL] Added server: %s:%d", host, port)
}

The Complete Updated Code

Here’s our full main.go with health checking:

package main

import (
    "fmt"
    "io"
    "log"
    "net"
    "net/http"
    "sync"
    "time"
)

type Backend struct {
    Host  string
    Port  int
    Alive bool
    mux   sync.RWMutex
}

func (b *Backend) IsAlive() bool {
    b.mux.RLock()
    defer b.mux.RUnlock()
    return b.Alive
}

func (b *Backend) SetAlive(alive bool) {
    b.mux.Lock()
    defer b.mux.Unlock()
    b.Alive = alive
}

func (b *Backend) Address() string {
    return fmt.Sprintf("%s:%d", b.Host, b.Port)
}

type ServerPool struct {
    backends []*Backend
    current  int
    mux      sync.RWMutex
}

func NewServerPool() *ServerPool {
    return &ServerPool{
        backends: make([]*Backend, 0),
        current:  0,
    }
}

func (p *ServerPool) AddBackend(host string, port int) {
    p.mux.Lock()
    defer p.mux.Unlock()

    backend := &Backend{
        Host:  host,
        Port:  port,
        Alive: true,
    }
    p.backends = append(p.backends, backend)
    log.Printf("[POOL] Added server: %s:%d", host, port)
}

func (p *ServerPool) GetNextBackend() *Backend {
    p.mux.Lock()
    defer p.mux.Unlock()

    if len(p.backends) == 0 {
        return nil
    }

    for i := 0; i < len(p.backends); i++ {
        backend := p.backends[p.current]
        p.current = (p.current + 1) % len(p.backends)

        if backend.IsAlive() {
            return backend
        }
    }

    return nil
}

func (p *ServerPool) GetAllBackends() []*Backend {
    p.mux.RLock()
    defer p.mux.RUnlock()

    result := make([]*Backend, len(p.backends))
    copy(result, p.backends)
    return result
}

func (p *ServerPool) Size() int {
    p.mux.RLock()
    defer p.mux.RUnlock()
    return len(p.backends)
}

func (p *ServerPool) HealthyCount() int {
    p.mux.RLock()
    defer p.mux.RUnlock()

    count := 0
    for _, b := range p.backends {
        if b.IsAlive() {
            count++
        }
    }
    return count
}

type HealthChecker struct {
    pool               *ServerPool
    checkInterval      time.Duration
    timeout            time.Duration
    unhealthyThreshold int
    healthyThreshold   int
    healthPath         string

    failureCounts map[string]int
    successCounts map[string]int
    countsMux     sync.Mutex
}

func NewHealthChecker(pool *ServerPool) *HealthChecker {
    return &HealthChecker{
        pool:               pool,
        checkInterval:      5 * time.Second,
        timeout:            3 * time.Second,
        unhealthyThreshold: 3,
        healthyThreshold:   2,
        healthPath:         "/health",
        failureCounts:      make(map[string]int),
        successCounts:      make(map[string]int),
    }
}

func (hc *HealthChecker) checkBackend(backend *Backend) bool {
    url := fmt.Sprintf("http://%s:%d%s", backend.Host, backend.Port, hc.healthPath)

    client := &http.Client{
        Timeout: hc.timeout,
    }

    resp, err := client.Get(url)
    if err != nil {
        return false
    }
    defer resp.Body.Close()

    return resp.StatusCode >= 200 && resp.StatusCode < 300
}

func (hc *HealthChecker) processHealthCheckResult(backend *Backend, isHealthy bool) {
    hc.countsMux.Lock()
    defer hc.countsMux.Unlock()

    key := backend.Address()
    wasAlive := backend.IsAlive()

    if isHealthy {
        hc.failureCounts[key] = 0
        hc.successCounts[key]++

        if !wasAlive && hc.successCounts[key] >= hc.healthyThreshold {
            backend.SetAlive(true)
            log.Printf("[HEALTH] Server %s is now HEALTHY", key)
            hc.successCounts[key] = 0
        }
    } else {
        hc.successCounts[key] = 0
        hc.failureCounts[key]++

        if wasAlive && hc.failureCounts[key] >= hc.unhealthyThreshold {
            backend.SetAlive(false)
            log.Printf("[HEALTH] Server %s is now UNHEALTHY", key)
            hc.failureCounts[key] = 0
        } else if wasAlive {
            log.Printf("[HEALTH] Server %s failed check (%d/%d)",
                key, hc.failureCounts[key], hc.unhealthyThreshold)
        }
    }
}

func (hc *HealthChecker) checkAllBackends() {
    backends := hc.pool.GetAllBackends()
    var wg sync.WaitGroup

    for _, backend := range backends {
        wg.Add(1)
        go func(b *Backend) {
            defer wg.Done()
            isHealthy := hc.checkBackend(b)
            hc.processHealthCheckResult(b, isHealthy)
        }(backend)
    }

    wg.Wait()
}

func (hc *HealthChecker) Start() {
    log.Printf("[HEALTH] Starting health checker (interval: %v, timeout: %v)",
        hc.checkInterval, hc.timeout)

    // initial check
    hc.checkAllBackends()

    ticker := time.NewTicker(hc.checkInterval)
    for range ticker.C {
        hc.checkAllBackends()
    }
}

type LoadBalancer struct {
    host       string
    port       int
    serverPool *ServerPool
}

func NewLoadBalancer(host string, port int, pool *ServerPool) *LoadBalancer {
    return &LoadBalancer{
        host:       host,
        port:       port,
        serverPool: pool,
    }
}

func (lb *LoadBalancer) Start() error {
    address := fmt.Sprintf("%s:%d", lb.host, lb.port)

    listener, err := net.Listen("tcp", address)
    if err != nil {
        return fmt.Errorf("failed to start listener: %v", err)
    }
    defer listener.Close()

    log.Printf("[LB] Load Balancer started on %s", address)
    log.Printf("[LB] Backend servers: %d", lb.serverPool.Size())

    for {
        conn, err := listener.Accept()
        if err != nil {
            log.Printf("[LB] Failed to accept connection: %v", err)
            continue
        }

        go lb.handleConnection(conn)
    }
}

func (lb *LoadBalancer) handleConnection(clientConn net.Conn) {
    defer clientConn.Close()

    backend := lb.serverPool.GetNextBackend()
    if backend == nil {
        log.Printf("[LB] No healthy backend servers available!")
        clientConn.Write([]byte("HTTP/1.1 503 Service Unavailable\r\n\r\nNo healthy backends"))
        return
    }

    backendAddress := backend.Address()
    log.Printf("[LB] Forwarding %s → %s", clientConn.RemoteAddr(), backendAddress)

    backendConn, err := net.Dial("tcp", backendAddress)
    if err != nil {
        log.Printf("[LB] Failed to connect to backend %s: %v", backendAddress, err)
        clientConn.Write([]byte("HTTP/1.1 502 Bad Gateway\r\n\r\nBackend connection failed"))
        return
    }
    defer backendConn.Close()

    lb.forwardTraffic(clientConn, backendConn)
}

func (lb *LoadBalancer) forwardTraffic(client, backend net.Conn) {
    var wg sync.WaitGroup
    wg.Add(2)

    go func() {
        defer wg.Done()
        io.Copy(backend, client)
    }()

    go func() {
        defer wg.Done()
        io.Copy(client, backend)
    }()

    wg.Wait()
}

func main() {
    pool := NewServerPool()
    pool.AddBackend("127.0.0.1", 8081)
    pool.AddBackend("127.0.0.1", 8082)
    pool.AddBackend("127.0.0.1", 8083)

    // start health checker in background
    healthChecker := NewHealthChecker(pool)
    go healthChecker.Start()

    lb := NewLoadBalancer("0.0.0.0", 8080, pool)
    if err := lb.Start(); err != nil {
        log.Fatalf("Load balancer failed: %v", err)
    }
}

Updating the Backend Server

Our backend servers need a /health endpoint. Let’s update backend/server.go:

package main

import (
    "fmt"
    "log"
    "net/http"
    "os"
    "strconv"
)

func main() {
    if len(os.Args) != 2 {
        fmt.Println("Usage: go run server.go <port>")
        os.Exit(1)
    }

    port, err := strconv.Atoi(os.Args[1])
    if err != nil {
        log.Fatalf("Invalid port: %v", err)
    }

    http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusOK)
        w.Write([]byte(`{"status": "healthy"}`))
    })

    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        log.Printf("[SERVER %d] Handled request for %s", port, r.URL.Path)

        response := fmt.Sprintf(`
<!DOCTYPE html>
<html>
<head>
    <title>Backend Server %d</title>
</head>
<body>
    <h1>Backend Server %d</h1>
    <p>Request was handled by server on port %d</p>
    <p>Path: %s</p>
    <p>Method: %s</p>
</body>
</html>
`, port, port, port, r.URL.Path, r.Method)

        fmt.Fprint(w, response)
    })

    address := fmt.Sprintf(":%d", port)
    log.Printf("[SERVER] Backend server started on port %d", port)

    if err := http.ListenAndServe(address, nil); err != nil {
        log.Fatalf("Server failed: %v", err)
    }
}

Testing Health Checking

Now let’s see our health checking in action.

Step 1: Start Backend Servers

# terminal 1
go run backend/server.go 8081

# terminal 2
go run backend/server.go 8082

# terminal 3
go run backend/server.go 8083

Step 2: Start the Load Balancer

# terminal 4
go run main.go

You should see:

[POOL] Added server: 127.0.0.1:8081
[POOL] Added server: 127.0.0.1:8082
[POOL] Added server: 127.0.0.1:8083
[HEALTH] Starting health checker (interval: 5s, timeout: 3s)
[LB] Load Balancer started on 0.0.0.0:8080
[LB] Backend servers: 3

Step 3: Verify Normal Operation

Make some requests:

for i in {1..6}; do curl -s http://localhost:8080 | grep "Backend Server"; done

Should see traffic distributed across all three servers.

Step 4: Kill a Server

Now the not-so-boring part. Kill server 8082:

# terminal 2
^C

Watch the load balancer logs:

[HEALTH] Server 127.0.0.1:8082 failed check (1/3)
[HEALTH] Server 127.0.0.1:8082 failed check (2/3)
[HEALTH] Server 127.0.0.1:8082 is now UNHEALTHY

After 3 failed checks, Server 8082 is marked unhealthy.

Step 5: Verify Traffic Routing

Now make more requests:

for i in {1..6}; do curl -s http://localhost:8080 | grep "Backend Server"; done

Output:

    <h1>Backend Server 8081</h1>
    <h1>Backend Server 8083</h1>
    <h1>Backend Server 8081</h1>
    <h1>Backend Server 8083</h1>
    <h1>Backend Server 8081</h1>
    <h1>Backend Server 8083</h1>

No more 8082!! The dead server is excluded from rotation.

Step 6: Bring the Server Back

Restart server 8082:

# terminal 2
go run backend/server.go 8082

Watch the load balancer logs:

[HEALTH] Server 127.0.0.1:8082 is now HEALTHY

After 2 successful health checks (about 10 seconds), Server 8082 is back in rotation.

Step 7: Verify Recovery

for i in {1..6}; do curl -s http://localhost:8080 | grep "Backend Server"; done

Output:

    <h1>Backend Server 8081</h1>
    <h1>Backend Server 8082</h1>
    <h1>Backend Server 8083</h1>
    <h1>Backend Server 8081</h1>
    <h1>Backend Server 8082</h1>
    <h1>Backend Server 8083</h1>

Server 8082 is back, man.

┌────────────────────────────────────────────────────────────────┐
│                                                                │
│   Timeline of Events:                                          │
│                                                                │
│   0:00 - All servers healthy, traffic distributed 1→2→3→1...   │
│                                                                │
│   0:05 - Server 2 dies                                         │
│                                                                │
│   0:10 - Health check fails (1/3)                              │
│                                                                │
│   0:15 - Health check fails (2/3)                              │
│                                                                │
│   0:20 - Health check fails (3/3) → Server 2 UNHEALTHY         │
│          Traffic now only goes to Server 1 and Server 3        │
│                                                                │
│   1:00 - Server 2 comes back online                            │
│                                                                │
│   1:05 - Health check succeeds (1/2)                           │
│                                                                │
│   1:10 - Health check succeeds (2/2) → Server 2 HEALTHY        │
│          Traffic distributed to all 3 servers again            │
│                                                                │
└────────────────────────────────────────────────────────────────┘

The State Machine

Our health checking logic can be visualized as a state machine:

State machine


Passive Health Checks (Bonus)

I mentioned earlier that production load balancers use both active AND passive health checks. We’ve implemented active checks. Let me quickly show you the idea behind passive checks.

The concept: if a real request to a backend fails, that’s useful health information too.

func (lb *LoadBalancer) handleConnection(clientConn net.Conn) {
    defer clientConn.Close()

    backend := lb.serverPool.GetNextBackend()
    if backend == nil {
        log.Printf("[LB] No healthy backend servers available!")
        clientConn.Write([]byte("HTTP/1.1 503 Service Unavailable\r\n\r\nNo healthy backends"))
        return
    }

    backendAddress := backend.Address()

    backendConn, err := net.Dial("tcp", backendAddress)
    if err != nil {
        log.Printf("[LB] Failed to connect to backend %s: %v", backendAddress, err)

        // PASSIVE HEALTH CHECK: mark backend as potentially unhealthy
        // in a real implementation, you'd track consecutive failures
        // and mark unhealthy after a threshold, just like active checks
        lb.recordBackendFailure(backend)

        clientConn.Write([]byte("HTTP/1.1 502 Bad Gateway\r\n\r\nBackend connection failed"))
        return
    }
    defer backendConn.Close()

    // PASSIVE HEALTH CHECK: mark backend as healthy (successful connection)
    lb.recordBackendSuccess(backend)

    lb.forwardTraffic(clientConn, backendConn)
}

The advantage of passive checks is that they detect failures in real time. But the disadvantage is that users experience those failures.

In practice, active checks are your first line of defense, and passive checks are a backup.

I won’t implement passive checks fully in this blog (it’s already long enough), but you get the idea.


What We Haven’t Covered (Yet)

Our health checking is solid, but there’s more we could do:

1. Circuit Breaker Pattern

What if a server is alive but returning errors (500s)? Or is very slow? Active TCP/HTTP checks might pass, but the server isn’t actually useful.

The circuit breaker pattern tracks real request success rates and “opens the circuit” (stops sending traffic) when error rates are too high.

I’ll cover this in a future blog when I talk about failure handling in depth.

2. Graceful Degradation

When a server becomes unhealthy, what about the requests currently being processed? They just get dropped. In production, you’d want “connection draining” to finish the current requests before removing a server.


What Now?

In the next part, I’m going to dive into different load balancing algorithms. Round Robin is cool and all but it isn’t really practical. What if your servers have different capacities? What if you want to send traffic to the server with the fewest connections? What if you want to consider server response times?

There’s a whole world of algorithms beyond RR, and we’ll explore them all.


Feel free to hit me up on X / Twitter if you have questions, found bugs, or just want to chat about load balancers. Always happy to hear from people working through this series.

See you in the next part :)