The Only Guide You'd Ever Need for Load Balancers - 5
Health Checking - Don’t Send Traffic to Dead Servers
Welcome back, sorry for taking so long, but life happens. If you’re coming from part 4, you just built a working load balancer. Actual code, actual traffic distribution, actual RR. Pretty cool, right?
But remember what happened when we killed one of the backend servers? 33% of our requests started failing. Our load balancer kept happily sending traffic to a dead server like nothing was wrong. That’s not just not optimal, that’s embarrassing.
In this part, we’re going to fix that. We’re going to teach our load balancer to detect when servers die and stop sending them traffic. This is called health checking, because we…check their health? I don’t know how I could’ve worded that any better, haha.
Servers Die
Let me paint you a picture. It’s 2 AM. You’re a sleeping beauty. Your wingman dating app has three backend servers, all happily serving users looking for love.
Then Server 2 crashes. Maybe the disk filled up. Maybe a memory leak finally hit because of your vibe coded app. Doesn’t matter. Server 2 is dead.
Before crash:

After server 2 crashes:

Without health checking, your load balancer has no way of knowing which servers are alive and which are dead. It just keeps rotating through the list, sending traffic to dead servers.
Types of Server Failures
Before we implement health checking, let’s understand what can actually go wrong. Servers don’t just “die” in one way. There are multiple failure modes, and they all suck differently.
1. Complete Server Down
The server machine is completely off. Power failure, hardware failure, someone unplugged it.

How it manifests: Connection refused or timeout. The TCP handshake never completes.
2. Process Crash
The server machine is fine, but the application crashed. The OS is running, but your web server process isn’t.

How it manifests: Connection refused. The port isn’t listening.
3. Network Partition
The server is fine, the application is fine, but there’s a network problem between the load balancer and the server.

How it manifests: Connection timeout. Packets are being dropped or lost.
4. Overloaded / Slow Server
This is the sneaky one. The server is technically “alive” but it’s so overwhelmed that it might as well be dead.

How it manifests: Connections succeed, but responses take forever or never come.
Why This Matters
Different failure types need different detection strategies:
| Failure Type | Detection Method |
|---|---|
| Server down | TCP connection fails |
| Process crash | TCP connection refused |
| Network partition | TCP connection timeout |
| Overloaded server | Response timeout / slow response |
A good health checking system needs to handle ALL of these.
Active vs Passive Health Checks
There are two approaches to detecting server failures. Both have their place.
Passive Health Checks
“Learn from real traffic”
With passive health checks, you don’t actively probe servers. Instead, you watch what happens when you send real user traffic.
Request 1 ──► Server 2 ──► Success ✓
Request 2 ──► Server 2 ──► Success ✓
Request 3 ──► Server 2 ──► FAILED ✗
Request 4 ──► Server 2 ──► FAILED ✗
Request 5 ──► Server 2 ──► FAILED ✗
"Hmm, 3 failures in a row..."
"Maybe Server 2 is dead?"
"Let me stop sending traffic to it"
Pros:
- No extra network traffic
Cons:
- Users experience the failures (that’s how you detect them)
- Slow to detect (need multiple failures)
Active Health Checks
“Proactively probe servers”
With active health checks, you continuously send test requests to servers to check if they’re alive. Real user traffic is separate.
Health Checker (background process):
Every 5 seconds:
├── Probe Server 1 ──► Response ✓ (healthy)
├── Probe Server 2 ──► No response ✗ (unhealthy)
└── Probe Server 3 ──► Response ✓ (healthy)
Real traffic only goes to healthy servers:
User Request ──► Server 1 or Server 3 (not 2)
Pros:
- Detects failures BEFORE users hit them
- Fast detection (configurable interval)
Cons:
- Extra network traffic
Which Should We Use?
Both.
Seriously.
- Active health checks for proactive detection
- Passive health checks as a backup (if active checks miss something)
For this blog, we’ll implement active health checks first. They give us the most bang for our buck and are what most people think of when they hear “health checking.”
Active Health Check Types
There are different ways to probe a server. Let’s start from the simplest.
1. TCP Health Check (Simplest)
Just try to establish a TCP connection. If it succeeds, the server is “alive.”

What it checks:
- Server machine is running
- Port is listening
- Network path is working
What it DOESN’T check:
- Application is working correctly
- Database connections are fine
- Dependencies are healthy
When to use: When you just need basic “is the port open” checks.
2. HTTP Health Check (Most Common)
Send an actual HTTP request to a health endpoint. Check the response.

What it checks:
- Everything TCP check does, PLUS
- HTTP server is responding
- App code is running
- Dependencies are healthy
When to use: Most of the time.
3. Custom Protocol Health Checks
For non-HTTP services (databases, message queues, custom protocols), you might need custom health checks.
Examples:
- MySQL: Run "SELECT 1" query
- Redis: Send PING, expect PONG
- gRPC: Call Health.Check RPC
- Custom TCP: Send magic bytes, check response
When to use: When your backend isn’t HTTP based.
For our load balancer, we’ll implement HTTP health checks since our backend servers are HTTP servers.
Health Check Parameters
Before we write code, let’s understand the knobs we can turn.
Interval
How often do we check server health?
Interval = 5 seconds:
Time: 0s 5s 10s 15s 20s 25s
│ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼
Check Check Check Check Check Check
- Too frequent (100ms): Waste bandwidth, stress servers
- Too infrequent (60s): Slow to detect failures
- Sweet spot: Usually 5-10 seconds for most applications
Timeout
How long do we wait for a health check response before giving up?
Timeout = 3 seconds:
Health Checker Server
│ │
│──── Health check request ────────►│
│ │
│ ... waiting ... │
│ ... still waiting ... │
│ ... 3 seconds pass ... │
│ │
│ TIMEOUT - server is UNHEALTHY │
- Too short (100ms): False positives on slow responses
- Too long (30s): Takes forever to detect failures
- Sweet spot: Usually 2-5 seconds
Threshold (Consecutive Failures)
How many consecutive failures before we mark a server as unhealthy?
Threshold = 3 consecutive failures:
Check 1: FAIL ─► Server still marked healthy (1/3 failures)
Check 2: FAIL ─► Server still marked healthy (2/3 failures)
Check 3: FAIL ─► Server marked UNHEALTHY (3/3 failures)
Why not mark unhealthy on first failure?
- Network blips happen
- Single request timeouts are normal
- We want to avoid “flapping” (dw, explained in the diagram below)
┌───────────────────────────────────────────────────────────────┐
│ │
│ Without threshold (mark unhealthy on first failure): │
│ │
│ Time: ──────────────────────────────────────────────► │
│ │
│ Status: HEALTHY ─► UNHEALTHY ─► HEALTHY ─► UNHEALTHY ─► │
│ ↑ ↑ ↑ ↑ │
│ 1 fail 1 success 1 fail 1 success │
│ │
│ This is called "flapping" - constant status changes │
│ Very bad for routing stability │
│ │
├───────────────────────────────────────────────────────────────┤
│ │
│ With threshold = 3: │
│ │
│ Time: ──────────────────────────────────────────────► │
│ │
│ Status: HEALTHY ────────────────────► UNHEALTHY ──────► │
│ ↑ │
│ 3 consecutive failures │
│ │
│ Much more stable │
│ │
└───────────────────────────────────────────────────────────────┘
Rise Count
How many consecutive successes before we mark a server as healthy again?
Rise = 2 consecutive successes:
Server was unhealthy, now recovering:
Check 1: SUCCESS ─► Still unhealthy (1/2 successes)
Check 2: SUCCESS ─► Marked HEALTHY again (2/2 successes)
This prevents a server that’s unstably recovering from immediately getting traffic.
Our Configuration
For our implementation, we’ll use:
| Parameter | Value | Why |
|---|---|---|
| Interval | 5 seconds | Good balance |
| Timeout | 3 seconds | Long enough for slow responses |
| Unhealthy Threshold | 3 failures | Avoid flapping |
| Healthy Threshold | 2 successes | Quick recovery |
Implementation Time
Alright, enough theory. Let’s write the code.
Updating Our Backend Structure
First, we need to track health status for each backend:
type Backend struct {
Host string
Port int
Alive bool // is this server healthy?
mux sync.RWMutex // for thread safe status updates
}
// getter
func (b *Backend) IsAlive() bool {
b.mux.RLock()
defer b.mux.RUnlock()
return b.Alive
}
// setter
func (b *Backend) SetAlive(alive bool) {
b.mux.Lock()
defer b.mux.Unlock()
b.Alive = alive
}
Notice we’re using sync.RWMutex instead of sync.Mutex. This allows multiple goroutines to read simultaneously (using RLock), but only one can write at a time. Since we read Alive status way more than we write it, this is obviously better here.
The Health Checker Structure
type HealthChecker struct {
pool *ServerPool
checkInterval time.Duration
timeout time.Duration
unhealthyThreshold int
healthyThreshold int
healthPath string
// track failures/successes per backend
failureCounts map[string]int
successCounts map[string]int
countsMux sync.Mutex
}
func NewHealthChecker(pool *ServerPool) *HealthChecker {
return &HealthChecker{
pool: pool,
checkInterval: 5 * time.Second,
timeout: 3 * time.Second,
unhealthyThreshold: 3,
healthyThreshold: 2,
healthPath: "/health",
failureCounts: make(map[string]int),
successCounts: make(map[string]int),
}
}
The Core Health Check Logic
Here’s the function that actually checks if a single server is healthy:
func (hc *HealthChecker) checkBackend(backend *Backend) bool {
url := fmt.Sprintf("http://%s:%d%s", backend.Host, backend.Port, hc.healthPath)
// create HTTP client with our timeout
client := &http.Client{
Timeout: hc.timeout,
}
resp, err := client.Get(url)
if err != nil {
// connection failed, timeout, or other error
return false
}
defer resp.Body.Close()
// we consider 2xx status codes as healthy
return resp.StatusCode >= 200 && resp.StatusCode < 300
}
Simple, right? We make an HTTP GET request to /health and check if we get a 2xx response.
Handling Consecutive Failures/Successes
Now the logic that tracks consecutive failures and updates server status:
func (hc *HealthChecker) processHealthCheckResult(backend *Backend, isHealthy bool) {
hc.countsMux.Lock()
defer hc.countsMux.Unlock()
key := fmt.Sprintf("%s:%d", backend.Host, backend.Port)
wasAlive := backend.IsAlive()
if isHealthy {
hc.failureCounts[key] = 0
hc.successCounts[key]++
// if server was down and we've hit the healthy threshold, bring it back
if !wasAlive && hc.successCounts[key] >= hc.healthyThreshold {
backend.SetAlive(true)
log.Printf("[HEALTH] Server %s is now HEALTHY (after %d successful checks)",
key, hc.successCounts[key])
hc.successCounts[key] = 0
}
} else {
hc.successCounts[key] = 0
hc.failureCounts[key]++
// if server was up and we've hit the unhealthy threshold, mark it down
if wasAlive && hc.failureCounts[key] >= hc.unhealthyThreshold {
backend.SetAlive(false)
log.Printf("[HEALTH] Server %s is now UNHEALTHY (after %d failed checks)",
key, hc.failureCounts[key])
hc.failureCounts[key] = 0
} else if wasAlive {
log.Printf("[HEALTH] Server %s failed check (%d/%d)",
key, hc.failureCounts[key], hc.unhealthyThreshold)
}
}
}
This is basically what’s happening:
Server starts HEALTHY:
Check 1: FAIL
failureCounts["server1"] = 1
Log: "Server server1 failed check (1/3)"
Status: still HEALTHY
Check 2: FAIL
failureCounts["server1"] = 2
Log: "Server server1 failed check (2/3)"
Status: still HEALTHY
Check 3: FAIL
failureCounts["server1"] = 3
Log: "Server server1 is now UNHEALTHY"
Status: UNHEALTHY ← now we stop sending traffic
Check 4: SUCCESS
successCounts["server1"] = 1
failureCounts["server1"] = 0 (reset)
Status: still UNHEALTHY
Check 5: SUCCESS
successCounts["server1"] = 2
Log: "Server server1 is now HEALTHY"
Status: HEALTHY ← back in rotation :)
The Background Health Check Loop
This runs continuously, checking all servers at the configured interval:
func (hc *HealthChecker) Start() {
log.Printf("[HEALTH] Starting health checker (interval: %v, timeout: %v)",
hc.checkInterval, hc.timeout)
ticker := time.NewTicker(hc.checkInterval)
defer ticker.Stop()
// initial check immediately
hc.checkAllBackends()
for range ticker.C {
hc.checkAllBackends()
}
}
func (hc *HealthChecker) checkAllBackends() {
backends := hc.pool.GetAllBackends()
for _, backend := range backends {
go func(b *Backend) {
isHealthy := hc.checkBackend(b)
hc.processHealthCheckResult(b, isHealthy)
}(backend)
}
}
We check all backends in parallel using goroutines. No point waiting for one slow server to time out before checking others.

Updating the Server Pool
We need to update GetNextBackend to skip unhealthy servers:
func (p *ServerPool) GetNextBackend() *Backend {
p.mux.Lock()
defer p.mux.Unlock()
if len(p.backends) == 0 {
return nil
}
// try to find a healthy backend
// we'll try at most len(backends) times to avoid infinite loop
for i := 0; i < len(p.backends); i++ {
backend := p.backends[p.current]
p.current = (p.current + 1) % len(p.backends)
if backend.IsAlive() {
return backend
}
}
return nil
}
// need this for health checker
func (p *ServerPool) GetAllBackends() []*Backend {
p.mux.RLock()
defer p.mux.RUnlock()
result := make([]*Backend, len(p.backends))
for i := range p.backends {
result[i] = p.backends[i]
}
return result
}
The key change: we now skip backends where IsAlive() returns false.
Before (no health checking):
Backends: [Server1, Server2(dead), Server3]
Current index: 0
GetNextBackend() → Server1, index becomes 1
GetNextBackend() → Server2 (DEAD! User gets error), index becomes 2
GetNextBackend() → Server3, index becomes 0
GetNextBackend() → Server1, index becomes 1
...
After (with health checking):
Backends: [Server1(alive), Server2(dead), Server3(alive)]
Current index: 0
GetNextBackend() → Server1 (alive), return it, index becomes 1
GetNextBackend() → Server2 (dead, skip), Server3 (alive), return it, index becomes 0
GetNextBackend() → Server1 (alive), return it, index becomes 1
...
Dead servers are automatically skipped
Updating AddBackend
New servers should start as healthy (assume they’re good until somethin bad happens, like innocent until guilty type of thing):
func (p *ServerPool) AddBackend(host string, port int) {
backend := &Backend{
Host: host,
Port: port,
Alive: true, // start as healthy
}
p.backends = append(p.backends, backend)
log.Printf("[POOL] Added server: %s:%d", host, port)
}
The Complete Updated Code
Here’s our full main.go with health checking:
package main
import (
"fmt"
"io"
"log"
"net"
"net/http"
"sync"
"time"
)
type Backend struct {
Host string
Port int
Alive bool
mux sync.RWMutex
}
func (b *Backend) IsAlive() bool {
b.mux.RLock()
defer b.mux.RUnlock()
return b.Alive
}
func (b *Backend) SetAlive(alive bool) {
b.mux.Lock()
defer b.mux.Unlock()
b.Alive = alive
}
func (b *Backend) Address() string {
return fmt.Sprintf("%s:%d", b.Host, b.Port)
}
type ServerPool struct {
backends []*Backend
current int
mux sync.RWMutex
}
func NewServerPool() *ServerPool {
return &ServerPool{
backends: make([]*Backend, 0),
current: 0,
}
}
func (p *ServerPool) AddBackend(host string, port int) {
p.mux.Lock()
defer p.mux.Unlock()
backend := &Backend{
Host: host,
Port: port,
Alive: true,
}
p.backends = append(p.backends, backend)
log.Printf("[POOL] Added server: %s:%d", host, port)
}
func (p *ServerPool) GetNextBackend() *Backend {
p.mux.Lock()
defer p.mux.Unlock()
if len(p.backends) == 0 {
return nil
}
for i := 0; i < len(p.backends); i++ {
backend := p.backends[p.current]
p.current = (p.current + 1) % len(p.backends)
if backend.IsAlive() {
return backend
}
}
return nil
}
func (p *ServerPool) GetAllBackends() []*Backend {
p.mux.RLock()
defer p.mux.RUnlock()
result := make([]*Backend, len(p.backends))
copy(result, p.backends)
return result
}
func (p *ServerPool) Size() int {
p.mux.RLock()
defer p.mux.RUnlock()
return len(p.backends)
}
func (p *ServerPool) HealthyCount() int {
p.mux.RLock()
defer p.mux.RUnlock()
count := 0
for _, b := range p.backends {
if b.IsAlive() {
count++
}
}
return count
}
type HealthChecker struct {
pool *ServerPool
checkInterval time.Duration
timeout time.Duration
unhealthyThreshold int
healthyThreshold int
healthPath string
failureCounts map[string]int
successCounts map[string]int
countsMux sync.Mutex
}
func NewHealthChecker(pool *ServerPool) *HealthChecker {
return &HealthChecker{
pool: pool,
checkInterval: 5 * time.Second,
timeout: 3 * time.Second,
unhealthyThreshold: 3,
healthyThreshold: 2,
healthPath: "/health",
failureCounts: make(map[string]int),
successCounts: make(map[string]int),
}
}
func (hc *HealthChecker) checkBackend(backend *Backend) bool {
url := fmt.Sprintf("http://%s:%d%s", backend.Host, backend.Port, hc.healthPath)
client := &http.Client{
Timeout: hc.timeout,
}
resp, err := client.Get(url)
if err != nil {
return false
}
defer resp.Body.Close()
return resp.StatusCode >= 200 && resp.StatusCode < 300
}
func (hc *HealthChecker) processHealthCheckResult(backend *Backend, isHealthy bool) {
hc.countsMux.Lock()
defer hc.countsMux.Unlock()
key := backend.Address()
wasAlive := backend.IsAlive()
if isHealthy {
hc.failureCounts[key] = 0
hc.successCounts[key]++
if !wasAlive && hc.successCounts[key] >= hc.healthyThreshold {
backend.SetAlive(true)
log.Printf("[HEALTH] Server %s is now HEALTHY", key)
hc.successCounts[key] = 0
}
} else {
hc.successCounts[key] = 0
hc.failureCounts[key]++
if wasAlive && hc.failureCounts[key] >= hc.unhealthyThreshold {
backend.SetAlive(false)
log.Printf("[HEALTH] Server %s is now UNHEALTHY", key)
hc.failureCounts[key] = 0
} else if wasAlive {
log.Printf("[HEALTH] Server %s failed check (%d/%d)",
key, hc.failureCounts[key], hc.unhealthyThreshold)
}
}
}
func (hc *HealthChecker) checkAllBackends() {
backends := hc.pool.GetAllBackends()
var wg sync.WaitGroup
for _, backend := range backends {
wg.Add(1)
go func(b *Backend) {
defer wg.Done()
isHealthy := hc.checkBackend(b)
hc.processHealthCheckResult(b, isHealthy)
}(backend)
}
wg.Wait()
}
func (hc *HealthChecker) Start() {
log.Printf("[HEALTH] Starting health checker (interval: %v, timeout: %v)",
hc.checkInterval, hc.timeout)
// initial check
hc.checkAllBackends()
ticker := time.NewTicker(hc.checkInterval)
for range ticker.C {
hc.checkAllBackends()
}
}
type LoadBalancer struct {
host string
port int
serverPool *ServerPool
}
func NewLoadBalancer(host string, port int, pool *ServerPool) *LoadBalancer {
return &LoadBalancer{
host: host,
port: port,
serverPool: pool,
}
}
func (lb *LoadBalancer) Start() error {
address := fmt.Sprintf("%s:%d", lb.host, lb.port)
listener, err := net.Listen("tcp", address)
if err != nil {
return fmt.Errorf("failed to start listener: %v", err)
}
defer listener.Close()
log.Printf("[LB] Load Balancer started on %s", address)
log.Printf("[LB] Backend servers: %d", lb.serverPool.Size())
for {
conn, err := listener.Accept()
if err != nil {
log.Printf("[LB] Failed to accept connection: %v", err)
continue
}
go lb.handleConnection(conn)
}
}
func (lb *LoadBalancer) handleConnection(clientConn net.Conn) {
defer clientConn.Close()
backend := lb.serverPool.GetNextBackend()
if backend == nil {
log.Printf("[LB] No healthy backend servers available!")
clientConn.Write([]byte("HTTP/1.1 503 Service Unavailable\r\n\r\nNo healthy backends"))
return
}
backendAddress := backend.Address()
log.Printf("[LB] Forwarding %s → %s", clientConn.RemoteAddr(), backendAddress)
backendConn, err := net.Dial("tcp", backendAddress)
if err != nil {
log.Printf("[LB] Failed to connect to backend %s: %v", backendAddress, err)
clientConn.Write([]byte("HTTP/1.1 502 Bad Gateway\r\n\r\nBackend connection failed"))
return
}
defer backendConn.Close()
lb.forwardTraffic(clientConn, backendConn)
}
func (lb *LoadBalancer) forwardTraffic(client, backend net.Conn) {
var wg sync.WaitGroup
wg.Add(2)
go func() {
defer wg.Done()
io.Copy(backend, client)
}()
go func() {
defer wg.Done()
io.Copy(client, backend)
}()
wg.Wait()
}
func main() {
pool := NewServerPool()
pool.AddBackend("127.0.0.1", 8081)
pool.AddBackend("127.0.0.1", 8082)
pool.AddBackend("127.0.0.1", 8083)
// start health checker in background
healthChecker := NewHealthChecker(pool)
go healthChecker.Start()
lb := NewLoadBalancer("0.0.0.0", 8080, pool)
if err := lb.Start(); err != nil {
log.Fatalf("Load balancer failed: %v", err)
}
}
Updating the Backend Server
Our backend servers need a /health endpoint. Let’s update backend/server.go:
package main
import (
"fmt"
"log"
"net/http"
"os"
"strconv"
)
func main() {
if len(os.Args) != 2 {
fmt.Println("Usage: go run server.go <port>")
os.Exit(1)
}
port, err := strconv.Atoi(os.Args[1])
if err != nil {
log.Fatalf("Invalid port: %v", err)
}
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte(`{"status": "healthy"}`))
})
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
log.Printf("[SERVER %d] Handled request for %s", port, r.URL.Path)
response := fmt.Sprintf(`
<!DOCTYPE html>
<html>
<head>
<title>Backend Server %d</title>
</head>
<body>
<h1>Backend Server %d</h1>
<p>Request was handled by server on port %d</p>
<p>Path: %s</p>
<p>Method: %s</p>
</body>
</html>
`, port, port, port, r.URL.Path, r.Method)
fmt.Fprint(w, response)
})
address := fmt.Sprintf(":%d", port)
log.Printf("[SERVER] Backend server started on port %d", port)
if err := http.ListenAndServe(address, nil); err != nil {
log.Fatalf("Server failed: %v", err)
}
}
Testing Health Checking
Now let’s see our health checking in action.
Step 1: Start Backend Servers
# terminal 1
go run backend/server.go 8081
# terminal 2
go run backend/server.go 8082
# terminal 3
go run backend/server.go 8083
Step 2: Start the Load Balancer
# terminal 4
go run main.go
You should see:
[POOL] Added server: 127.0.0.1:8081
[POOL] Added server: 127.0.0.1:8082
[POOL] Added server: 127.0.0.1:8083
[HEALTH] Starting health checker (interval: 5s, timeout: 3s)
[LB] Load Balancer started on 0.0.0.0:8080
[LB] Backend servers: 3
Step 3: Verify Normal Operation
Make some requests:
for i in {1..6}; do curl -s http://localhost:8080 | grep "Backend Server"; done
Should see traffic distributed across all three servers.
Step 4: Kill a Server
Now the not-so-boring part. Kill server 8082:
# terminal 2
^C
Watch the load balancer logs:
[HEALTH] Server 127.0.0.1:8082 failed check (1/3)
[HEALTH] Server 127.0.0.1:8082 failed check (2/3)
[HEALTH] Server 127.0.0.1:8082 is now UNHEALTHY
After 3 failed checks, Server 8082 is marked unhealthy.
Step 5: Verify Traffic Routing
Now make more requests:
for i in {1..6}; do curl -s http://localhost:8080 | grep "Backend Server"; done
Output:
<h1>Backend Server 8081</h1>
<h1>Backend Server 8083</h1>
<h1>Backend Server 8081</h1>
<h1>Backend Server 8083</h1>
<h1>Backend Server 8081</h1>
<h1>Backend Server 8083</h1>
No more 8082!! The dead server is excluded from rotation.
Step 6: Bring the Server Back
Restart server 8082:
# terminal 2
go run backend/server.go 8082
Watch the load balancer logs:
[HEALTH] Server 127.0.0.1:8082 is now HEALTHY
After 2 successful health checks (about 10 seconds), Server 8082 is back in rotation.
Step 7: Verify Recovery
for i in {1..6}; do curl -s http://localhost:8080 | grep "Backend Server"; done
Output:
<h1>Backend Server 8081</h1>
<h1>Backend Server 8082</h1>
<h1>Backend Server 8083</h1>
<h1>Backend Server 8081</h1>
<h1>Backend Server 8082</h1>
<h1>Backend Server 8083</h1>
Server 8082 is back, man.
┌────────────────────────────────────────────────────────────────┐
│ │
│ Timeline of Events: │
│ │
│ 0:00 - All servers healthy, traffic distributed 1→2→3→1... │
│ │
│ 0:05 - Server 2 dies │
│ │
│ 0:10 - Health check fails (1/3) │
│ │
│ 0:15 - Health check fails (2/3) │
│ │
│ 0:20 - Health check fails (3/3) → Server 2 UNHEALTHY │
│ Traffic now only goes to Server 1 and Server 3 │
│ │
│ 1:00 - Server 2 comes back online │
│ │
│ 1:05 - Health check succeeds (1/2) │
│ │
│ 1:10 - Health check succeeds (2/2) → Server 2 HEALTHY │
│ Traffic distributed to all 3 servers again │
│ │
└────────────────────────────────────────────────────────────────┘
The State Machine
Our health checking logic can be visualized as a state machine:

Passive Health Checks (Bonus)
I mentioned earlier that production load balancers use both active AND passive health checks. We’ve implemented active checks. Let me quickly show you the idea behind passive checks.
The concept: if a real request to a backend fails, that’s useful health information too.
func (lb *LoadBalancer) handleConnection(clientConn net.Conn) {
defer clientConn.Close()
backend := lb.serverPool.GetNextBackend()
if backend == nil {
log.Printf("[LB] No healthy backend servers available!")
clientConn.Write([]byte("HTTP/1.1 503 Service Unavailable\r\n\r\nNo healthy backends"))
return
}
backendAddress := backend.Address()
backendConn, err := net.Dial("tcp", backendAddress)
if err != nil {
log.Printf("[LB] Failed to connect to backend %s: %v", backendAddress, err)
// PASSIVE HEALTH CHECK: mark backend as potentially unhealthy
// in a real implementation, you'd track consecutive failures
// and mark unhealthy after a threshold, just like active checks
lb.recordBackendFailure(backend)
clientConn.Write([]byte("HTTP/1.1 502 Bad Gateway\r\n\r\nBackend connection failed"))
return
}
defer backendConn.Close()
// PASSIVE HEALTH CHECK: mark backend as healthy (successful connection)
lb.recordBackendSuccess(backend)
lb.forwardTraffic(clientConn, backendConn)
}
The advantage of passive checks is that they detect failures in real time. But the disadvantage is that users experience those failures.
In practice, active checks are your first line of defense, and passive checks are a backup.
I won’t implement passive checks fully in this blog (it’s already long enough), but you get the idea.
What We Haven’t Covered (Yet)
Our health checking is solid, but there’s more we could do:
1. Circuit Breaker Pattern
What if a server is alive but returning errors (500s)? Or is very slow? Active TCP/HTTP checks might pass, but the server isn’t actually useful.
The circuit breaker pattern tracks real request success rates and “opens the circuit” (stops sending traffic) when error rates are too high.
I’ll cover this in a future blog when I talk about failure handling in depth.
2. Graceful Degradation
When a server becomes unhealthy, what about the requests currently being processed? They just get dropped. In production, you’d want “connection draining” to finish the current requests before removing a server.
What Now?
In the next part, I’m going to dive into different load balancing algorithms. Round Robin is cool and all but it isn’t really practical. What if your servers have different capacities? What if you want to send traffic to the server with the fewest connections? What if you want to consider server response times?
There’s a whole world of algorithms beyond RR, and we’ll explore them all.
Feel free to hit me up on X / Twitter if you have questions, found bugs, or just want to chat about load balancers. Always happy to hear from people working through this series.
See you in the next part :)