November 24, 2025· 10 min read

The Only Guide You'd Ever Need for Load Balancers - 1

#load balancers #computer science #development #system design

The Problem We Have

If you’re building anything that you expect more than your mom to use, you’re going to run into the problems at some point. And if you think “my app won’t get that popular,” I mean, it might be true ‘cause there’s no way your goated app which makes your screensaver be that kiss video of Sabrina Carpenter can ever get poular (a lie, this app has insane potential), but there’s no wrong in knowing everything. And in being prepared for everything. By the end of this series, you’ll (hopefully) understand not just how load balancers work, but why they’re built the way they are, and how to implement them yourself. Let’s start with a scenario you can actually relate to.

Your Average Web App

Imagine you’re the wingman for a couple regularly. You know other people who do the same. What do you do? You come up with a “genius” idea of creating a dating app for these “wingman people” so that they can get out of their misery. In the beginning, it’s just you and your friend (lie, you’ve no friends) using it. Everything seems to be working. They can connect with you, send you messages, yada yada. It grows a bit, there are more people in the same boat than you expected. Now, 20-30 people use it, and it works. You’re happy. Your lil server just works. Life is good.

Basic client server architecture

Now, what’s actually happening under the hood (your server)?

Server Resources

Every time someone makes a request to your server, your server uses resources to process that request. I’d say, there are 4 of these “resources” that you should worry about:

CPU: I know y’all are questioning me RN for even covering this, but I did promise “from ground up”, didn’t I? This is the bread to your server’s butter. It does all the processing and calculations, basically all the “thinking” work.
Memory: The short term memory. Much faster than the normal storage. Imagine your CPU regularly uses something. It wouldn’t normally keep calling your normal storage just to use it over and over, right? That’s too slow. So basically, stuff gets stored in the RAM for others to use. It’s capacity is almost always lower than your main disk space though, so if it ever runs out, it uses a “swap”. There’s a whole lot of things I can explain here, but that’s out of the scope of this series TBH.
Network Bandwidth: The pipe that connects your server to the net. Every image, JS slop, API response travels through this pipe.
Disk I/O: How fast your server can read & write.

Metrics That Matter

Now, you know all these terms and you know what a basic server is like, how do you know your server is “healthy”? You monitor these three metrics:

Response Time

How long it takes for your server to respond to a request. Includes:

time to execute the code
time to query the DB
time to send the response back

Try and keep your response time under 200ms only, everyone will love your unlovable ass.

Throughput

Number of requests/queries/hits your server can handle per second (aka “RPS”). For example, throughput = 100 = your server can handle 100 RPS

Concurrent Users

Pretty self explanatory, but really important as well. How many users concurrently use your product has (or should have) a pretty big impact on your (future) decisions.

For example
- 50 concurrent users
- Each user makes 5 requests per minute on average
- That's 250 requests per minute = ~4 requests per second

- 500 concurrent users
- Same 5 requests per minute
- That's 2,500 requests per minute = ~42 requests per second

Now, let’s talk about what happens when these numbers start climbing.

When Traffic Increases

Now comes 14th Feb. You, a wingman, are doing horrible today. Every couple is enjoying themselves. You and your users aren’t. You all are way more desperate today. Your lil product shoots up in numbers. You go from 50 visitors a day to 2000 visitors an hour. In one day.

Here’s what happens in real-time:

Stage 1: Everything Seems Fine (For Now)

Your server is handling 100 concurrent users. Response times are still decent, maybe 150ms average. CPU is at 40%, memory at 50%. We’re good, the server can handle this, right? Wrong.

Stage 2: The Warning Signs

200 concurrent users. Response times creep up to 400ms. Some requests are taking 800ms. CPU hits 70%. Your database queries are slower because there are more of them happening now.

Users start noticing. Pages work like they were made by GPT 2. But it’s still technically working. Right? Wrong again.

Stage 3: The Failure

300 concurrent users. CPU hits 95%. Response times are now 2-3 seconds, some timing out completely. Here’s where things get really bad:

The Death Spiral:

Slow responses -> requests take longer to complete
Longer requests -> connections stay open longer
More open connections -> more memory usage
More queued requests -> more CPU usage trying to handle them
High CPU usage -> everything slower
Slower responses -> users refresh the page
Refreshing -> even more requests
More requests -> everything even slower

The server slowly failing as requests increase

Result: Server effectively down. Users see error pages

Stage 4: Complete Failure

Your server stops responding to new requests entirely. Existing connections time out. Users see “503 Service Unavailable” or “Connection Timeout” errors. Your database locks up because too many queries are fighting for resources.

You’re officially down.

Even after traffic drops, your server might not recover automatically. Sometimes you have to restart everything manually, so yay even more downtime.

The First Instinct: “Let’s Upgrade the Server”

Your immediate thought is probably: “I’ll just get a bigger server, ez”

This is called vertical scaling (or “scaling up”), and it means upgrading your existing server’s hardware:

More CPU cores
More RAM
Faster disks
More network bandwidth

This actually works great… for a while.

The Vertical Scaling Journey

Let’s say you started with a basic server that costs $20/month and handles 100 concurrent users. You upgrade:

Vertical Scaling Progression:

$20/month   → 100 concurrent users   → 2 CPU, 4GB RAM
$50/month   → 300 concurrent users   → 4 CPU, 8GB RAM
$150/month  → 1,000 concurrent users → 8 CPU, 16GB RAM
$500/month  → 3,000 concurrent users → 16 CPU, 32GB RAM
$2,000/month → 8,000 concurrent users → 32 CPU, 64GB RAM

Notice the pattern? The cost increases exponentially, but the capacity doesn’t. It’s not even a linear increase. Much worse.

Problems with Vertical Scaling

The Cost Curve

As you scale up, the price:performance ratio gets worse. Moving from 2 cores to 4 cores might double your cost. But moving from 16 cores to 32 cores might triple or quadruple your cost for less than 2x the performance.

And you end up paying $10,000/month for a server that can handle 20,000 concurrent users. But what if you need to handle 50,000? The math stops making sense.

Physical Limits

There’s a ceiling to how big a single machine can get. You can’t just keep adding infinite CPU cores or infinite RAM. At some point, you hit the physical limits of current technology.

The biggest servers money can buy might have 128 cores and 1TB of RAM. That sounds like a lot, but if you’re Netflix or YT, that’s a rounding error. You’d need horizontal scaling anyway.

Single Point of Failure

This is the biggest problem, and it’s what should scare you the most.

No matter how powerful your server is, it’s still one server.

If that server:

Has a hardware failure
Needs maintenance or updates
Gets compromised by an attack
Has a software bug that crashes it
Loses power
Has a network issue

Your entire application goes down. Everything. 100% unavailable.

Single server -> single point of failure

If this server fails → Everyone loses access

There’s no backup, no safety net. All your eggs are in one basket, and that basket WILL fail at some point in your life.

So…What If We Had Multiple Servers?

*yoda’s voice* Now off goes, the lightbulb. Instead of making one server bigger and bigger, what if we had multiple smaller servers working together? This is called horizontal scaling (or “scaling out”).

Three $50/month servers might give you better total performance than one $150/month server, and definitely better than one $500/month server.
If one server dies, the others keep running. Your site stays up (maybe just slower).
Need more capacity? Add another server. Valentine’s is over now? Remove some servers. You can scale up and down as needed.
Need to update a server? Do it one at a time while the others handle traffic. Zero downtime.

Multiple servers setup

Notice those red question marks in the middle? This is what we’re about to focus on.

The New Problem

How do clients know which server to connect to?

Your website is www.forwingmen.com. When a user types that into their browser, how does it know which server to connect to? More specifically, which server’s IP address should it use?

Let’s say you have three servers:

Server 1: 192.168.1.10
Server 2: 192.168.1.11
Server 3: 192.168.1.12

When 1,000 users visit your site simultaneously, you need to somehow distribute them across these three servers. But how?

Let’s brainstorm ideas together.

The Naive Approaches (That Don’t Work)

Idea 1: “Users can just pick a server bro”

Yeah, no. Users have no idea you even have multiple servers, nor should they. They just want to visit www.forwingmen.com.

Idea 2: “I’ll tell users the different IP addresses”

So your homepage is like “give a try to 192.168.1.10, or to 192.168.1.11 if that doesn’t work”? Yeah…you already know how horrible this is.

Idea 3: “Each server can redirect to less busy servers”

Good boy, now we’re getting somewhere. Now you need servers to know about each other’s load in real time, and users would see multiple redirects before reaching a page. Also, what happens if the first server they hit is the one that’s down?

And there are many more issues with this approach:

How do clients even find out about multiple servers?
How do you ensure traffic is spread evenly?
What happens when a server goes down?
If a user is logged in on Server 1, what happens when their next request goes to Server 2?
How do you ensure all servers are running the same code and have access to the same data?

Obviously this isn’t the approach we want, but we’re getting there. We’re thinking differently. Now we have a pretty decent idea that we can’t just throw multiple servers at the problem and call it a day. So…what do we even do then?

The Load Balancer

Imagine if we had something that:

from the client’s POV:
- would take all requests of our clients from www.forwingmen.com
- would handle the redirection to different servers if and when needed automatically
from the server’s POV:
- receive requests from a single middleware
- send responses back to that single middleware

This is where load balancers come in. A load balancer is something that has the potential to replace those question marks in the diagram above that we saw earlier.

The load balancer’s job is simple but complex to implement:

Accept incoming client connections
Choose which backend server should handle each request
Forward the request to that server
Receive the response from the server
Send the response back to the client
Monitor server health and stop sending traffic to dead servers

This would literally solve all of our problems here:

Clients only need to know one address (the load balancer)
Traffic gets distributed across multiple servers
Failed servers get automatically excluded
We can handle session persistence (we’ll cover this later)
We can add or remove servers without clients knowing

What Now?

This seems like a pretty decent introduction to the problem we’ll solve, and how. So I’ll end this part here. In the next part, we’re going to build the simplest possible load balancer: DNS based round robin. We’ll implement it, test it, and then break it to understand why we need something better.

This series will easily be 20+ parts long, because I’m here for the long game. Going to teach as many things as possible.

See you in the next part, soldier :)