ZooKeeper Under Contention — Correctness Is Easy, Throughput Isn’t

ZooKeeper gives you something that feels deceptively strong: strict ordering, versioned writes, and a guarantee that updates won’t be silently lost — serialized writes. You read a value, you write with a version, and if someone else got there first, your write fails. Clean. Deterministic.

On paper, that looks like everything you want.

Then you put 1000 clients on a single znode.

I tried a very simple thing: a distributed counter on a 3-node ZooKeeper cluster. Nothing fancy. Just increment a value stored at path f.

The most natural implementation is optimistic concurrency. Read the value, increment it, and write it back with the version you read. If it fails, retry.

while(1) {
    val, ver = getdata("f")
    if setdata("f", val+1, ver) {
        break
    }
}

This is the kind of code you’d write without thinking too much. It’s lock-free, simple, and leverages ZooKeeper’s guarantees.

It also collapses under contention.

The behavior becomes clearer when you look at it side by side:

Scenario	Final Value	Total Retries
100 clients	100	2,663
1000 clients	1000	250,594

Even at moderate contention, retries dominate useful work. At higher contention, the system spends most of its time failing and retrying instead of progressing.

What’s happening here isn’t subtle. Multiple clients read the same version, all attempt the same write, and ZooKeeper allows exactly one to succeed. Everyone else fails and retries. Those retries overlap again, collide again, and fail again.

You don’t get gradual degradation. You get a feedback loop.

More clients → more collisions → more retries → even more overlap → even more collisions.

The system is technically correct the entire time. No updates are lost. But most of the work being done is wasted.

There’s another layer to this that isn’t obvious until you look at the numbers carefully. This was on a 3-node ensemble—the smallest possible quorum. Every write needs a majority, so at least 2 nodes have to acknowledge it.

As you increase the number of nodes, the quorum size increases, write latency increases, and the window in which other clients can collide gets wider. So the retry problem doesn’t stay constant—it amplifies.

This is one of those cases where adding more nodes can actually make a hot key worse.

I then tried adding exponential backoff. Same logic, just don’t retry immediately.

Strategy	Total Retries	Time Taken
No backoff	~250,000	> 2 min
Exponential backoff (10ms)	~27,000	~13 sec

Same logic, same correctness—just different retry behavior.

Nothing about correctness changed. Only the retry behavior changed.

What backoff does here is not optimization in the usual sense—it’s damage control. It reduces synchronization between failing clients. Instead of everyone retrying at the same time, retries get spread out, so fewer clients collide on each attempt.

You still have contention. You’re just making it less chaotic.

At this point, the pattern becomes clear: as long as everyone is trying to update the same value concurrently, you are either going to waste work resolving conflicts or introduce delays to avoid them.

So I tried the opposite approach. Instead of letting clients collide and resolve conflicts, we prevent collisions entirely.

ZooKeeper gives you ephemeral sequential nodes, which are perfect for building a lock queue.

path = create("/lock/lock-", EPHEMERAL|SEQUENTIAL)
 
while(1) {
    children = getChildren("/lock")
    sort(children)
 
    if myNode == children[0] {
        break
    }
 
    prev = node just before mine
    wait for deletion of prev (watch)
}
 
val, ver = getdata("f")
setdata("f", val+1, ver)
 
delete(path)

The flow looks simple, but there are three decisions here that make the whole thing work.

First, the node we create is EPHEMERAL | SEQUENTIAL.

Sequential means ZooKeeper appends a monotonically increasing number to the node name. So if 3 clients call create, you’ll get something like: lock-0001, lock-0002, lock-0003

That number is your position in line. No extra coordination is needed—ordering is encoded in the name itself.

Ephemeral means the node is tied to the client’s session. If the client crashes or disconnects, ZooKeeper automatically deletes the node. That’s critical because it ensures locks don’t get stuck behind dead clients.

So with one call, you’ve solved two problems: ordering and failure handling.

Second, who actually has the lock?

The rule is simple: the client with the smallest sequence number owns the lock.

Every client lists children under /lock, sorts them, and checks if it is the first element. If yes, it enters the critical section. If not, it has to wait.

Third, how waiting works is where the real trick is.

If you just keep polling or if everyone watches the entire /lock directory, you recreate the same contention problem. When the lock is released, all clients wake up, race again, and you’re back to chaos.

Instead, each client finds the node just before it:

C1 <- C2 <- C3 <- C4

If you are C3, you don’t care about C1. You only care about C2.

So you set a watch on C2 and go to sleep.

When C2 finishes, it deletes its node. ZooKeeper triggers exactly one watch—C3 wakes up, re-checks, and now becomes the smallest node, so it proceeds.

Then the same thing repeats for C4.

So instead of 1000 clients competing at once, you get a chain of handoffs where exactly one client becomes active at a time.

That’s the core idea: take N-way contention and turn it into a sequence of 1-by-1 transfers.

There are no retries due to conflicts, no thundering herd, and no wasted work. Just waiting and deterministic progress.

Now clients don’t compete at the same time. They line up.

Each client creates a sequential node, and ZooKeeper orders them. You only proceed if you’re the smallest node. Otherwise, you watch the node just before you and wait.

This is important: you don’t watch the whole set. You watch exactly one predecessor. That’s what avoids the thundering herd problem.

Under the same 1000 client load, the behavior changes completely:

Approach	Final Value	Total Retries	Time Taken
Optimistic (no backoff)	1000	~250,000	> 2 min
Optimistic (backoff)	1000	~27,000	~13 sec
Lock (sequential)	1000	~10,405	~11 sec

A quick note on the ~10k “retries” here: these are not write conflicts. They come from client-side timeouts / rechecks. A client should not wait indefinitely on a watch (sessions can expire, events can be missed), so it periodically wakes up, re-lists children, and re-evaluates its position. Those wakeups show up as retries in the numbers, even though there are no conflicting writes.

The system stops wasting work and starts making steady progress.

Nothing is "faster" in the sense of parallelism. In fact, it’s strictly slower per operation because everything is serialized. But the system is no longer wasting work. There are no failed writes due to version conflicts. Progress is steady.

This is the trade-off in its simplest form.

With optimistic concurrency, you get parallelism, but you pay for it with retries under contention. With locking, you remove retries, but you give up parallelism.

You’re choosing between wasted work and waiting.

ZooKeeper doesn’t hide this trade-off. It enforces it.

The interesting part is that both approaches are “correct,” and both can look reasonable in isolation. The difference only shows up when you put real load on a shared piece of state.

And that’s the actual problem here.

The bottleneck isn’t ZooKeeper. It’s the fact that thousands of clients are trying to mutate a single value.

As long as that remains true, you don’t have many options. You either coordinate aggressively (locks), or you let conflicts happen and deal with the fallout (retries and backoff).

There isn’t a third strategy hiding somewhere in the API.

At some point, the question stops being about which ZooKeeper primitive to use, and starts being about why the system is designed around a single hot key in the first place.

Because once contention gets high enough, the system isn’t limited by correctness anymore.

It’s limited by how much inefficiency you’re willing to tolerate to maintain it.