Dhruv
Dhruv Gupta

Engineer by craft, explorer by instinct

The Hardest Part of Raft Is Not Elections — It’s Log Replication

Why most Raft implementations don’t fail during leader election, but break down while replicating logs—when nodes diverge, leaders crash mid-flight, and consistency has to be actively repaired under real-world conditions where nothing stays perfectly in sync.

January 9, 2026
Distributed SystemsRaftConsensus

The Hardest Part of Raft Is Not Elections — It’s Log Replication


Raft is often sold as the “understandable” consensus algorithm. And honestly, the first time you read it, that claim feels true. Elections make sense. You randomize timeouts, nodes request votes, one leader emerges, and the system moves on. It feels clean and almost mechanical.

Then you try to build it.

That’s when you realize the hard part was never elections. It’s log replication.

On paper, replication looks boringly simple. A leader receives a command, appends it to its log, sends an AppendEntries RPC, waits for a majority, and marks the entry as committed. If you just read that flow, it feels like plumbing.

In practice, it’s anything but.

The first place things start to feel uncomfortable is prevLogIndex and prevLogTerm. These two fields look like small bookkeeping details, but they are actually the entire safety guarantee of Raft. Every follower uses them to decide whether the leader is telling a consistent story. If they don’t match, the follower rejects the request. Now the leader has to backtrack and try again.

And that “try again” is where reality kicks in.

In a clean system, logs are mostly aligned and replication is just appending new entries. In a real system, logs are almost always a little messy. Leaders crash mid-replication. Followers end up with partial entries. New leaders come in and overwrite history. What you get is not a smooth append-only log, but a constantly shifting structure that needs to be corrected again and again.

Something like this is very normal:

Leader:   [1, 2, 3, 4]
Follower: [1, 2, 5, 6]

A state like this usually comes from a very real sequence of events, not an edge case.

Imagine this:

  • Leader L1 is active and appends entries 3 and 4
  • It replicates 3 to some followers, but not all
  • Before 4 is fully replicated, L1 crashes
  • A new leader L2 is elected from a node that never saw 3 and 4
  • L2 appends new entries 5 and 6

Now different nodes have different "truths" about history.

So when the old leader (or a node that followed it) comes back, you end up with exactly this divergence. Both sides have valid logs from their perspective, but only one can survive.

Now the leader isn’t just sending new entries. It has to:

  • find the last matching index
  • force the follower to delete conflicting entries (3, 4)
  • replay the correct entries (5, 6)

That delete-and-rewrite step is where a lot of subtle bugs hide. If you truncate too early, you lose valid data. If you don’t truncate when you should, logs never converge.

That’s not replication anymore. That’s repair.

Most implementations break here because they assume divergence is rare. It isn’t. It’s the default state of a system that has seen even a little bit of failure.

Another subtle place people get tripped up is commit semantics. There’s a common intuition that once a majority has an entry, it’s “everywhere.” But that’s not how Raft works. The leader may consider an entry committed, while some followers are still behind. So the system is logically consistent but physically out of sync.

That gap matters more than it seems. In a real system, your application layer is often reading from the state machine while replication is still catching up.

If you apply entries too early, you risk serving state that can still be rolled back in case of leadership change. If you apply too late, your system starts feeling slow even though the leader has already done the work.

This becomes especially tricky when you have read-after-write expectations. A client writes something, gets success, and immediately reads—but the follower it hits might not have applied that entry yet.

Getting this boundary right is less about theory and more about understanding how your system is actually being used.

Retries sound simple too, until you actually need them. The naive approach is to decrement nextIndex one step at a time until the follower accepts. That works, but it’s painfully slow when logs have diverged a lot. You end up sending a large number of failed requests just to find the matching point.

Better implementations try to jump back faster, using term information or something closer to a binary search.

For example, instead of decrementing one index at a time, the leader can look at the term of the conflicting entry and skip entire ranges of the log. This drastically reduces the number of round trips when logs are far apart.

But this adds complexity—now you’re not just tracking indices, you’re reasoning about terms and how they map across nodes. This is one of those areas where a "working" implementation can still be very inefficient under real conditions.

Then there’s the fact that distributed systems don’t give you nice, ordered responses. You might send two replication requests and get the response for the later one first. If your state updates assume ordering, you’ll corrupt your own understanding of the follower’s progress. Suddenly your commit logic starts behaving inconsistently, and debugging that is not fun.

Followers also aren’t as passive as they seem. They’re constantly validating what the leader sends, checking terms, rejecting inconsistencies, and truncating their own logs when required. A weak follower implementation can quietly break the guarantees of the entire system, even if the leader looks correct.

And all of this gets worse once persistence enters the picture. Raft relies on writing logs and metadata to disk, but disks don’t give you atomic perfection for free. You can crash after a partial write. You can restart with a half-updated state. If your persistence layer isn’t careful, you can violate invariants without even realizing it.

After going through all of this, the way you think about Raft changes. It stops feeling like a leader election algorithm with some replication attached. It starts looking like a log consistency protocol where elections are just a mechanism to decide who gets to drive the repair process.

Elections decide who leads. Log replication decides whether your system is actually correct.

Most bugs you’ll encounter won’t come from elections. They’ll come from subtle issues in how logs are compared, how conflicts are resolved, how retries are handled, or how state is applied. These are not obvious when you read the paper, but they show up very quickly when you build something real.

Raft is easy to explain. That part is true.

But once you’re dealing with real failures, real divergence, and real state, log replication stops being simple. And that’s exactly where the algorithm earns its complexity.