Your BFT Protocol Will Break in Production

- 4 mins

Since there are a few teams building variations of BFT consensus protocols like Espresso, Monad, Plasma, Aptos and Sui, we thought we would write a blog on some of the security problems that have arisen during our time working on Espresso’s Hotshot, which was originally an implementation of HotStuff and is now an implementation of HotStuff-2.


Table of Contents


Quick Explanation of BFT

Byzantine Fault Tolerant (BFT) consensus is a class of algorithm dedicated to the resolution of conflicts in a system where each node is treated as adversarial. The problem in distributed systems derives from the Byzantine Generals Problem, which seeks to quantify the behavior of potentially malicious parties within a distributed network.

These systems are mathematically secure, typically employing proofs of the security of the protocol which are rigorously evaluated via the peer review process. Contemporary BFT protocols deployed within Web3 environments rely on a bunch of different security assumptions which, if violated, can compromise the network. A non-exhaustive list is as follows:


Quick Explanation of HotStuff

HotStuff is a modern BFT consensus protocol that streamlines agreement across network nodes using a pipelined three-phase commit: prepare, pre-commit, and commit. Each view has a single leader responsible for consensus in that round.

Leader-heavy BFT, like HotStuff, defers protocol progression to the leader at a given step. Its pipelined approach and linear communication complexity make it scalable. HotStuff2 is an evolution of this protocol, hinging on the so-called “chain rule.”


Terminology


Classes of Failures

Rather than rare bugs like unsigned proposals, we focus on halts — situations where the network can no longer progress.

A halt means the view or round stops increasing. While timeout mechanisms should rotate the leader and avoid this, halts still occur due to:

We call these mismatches rate-limiting steps — bottlenecks outside the protocol that determine its effective speed.


Security Flaws

Halts and DoS

Condition #1: Signature Replay

Imagine a signed proposal being replayed. Even if the proposal includes a view number, improper validation can allow it.

In pipelined HotStuff, views progress concurrently. A malicious node might send a replayed proposal with a valid aggregate signature for a future view, disrupting the pipeline.

Failure Modes:

Mitigation Summary:

Each node should:


Condition #2: Deferred Validation

A halt can occur if a proposal passes voting before critical data is validated.

Example: A proposer sets the latest fetched L1 block to a massive number. If this isn’t validated, the next proposer cannot proceed until that number is reached — halting the network.

This is often a programming oversight, not a protocol failure.

Mitigation Summary:

All proposal data must be validated before quorum formation. Malicious nodes or custom implementations that skip checks should be caught at this stage.


Implementation Errors

Race Condition in State Update

Three functions exist:

  1. One updates state.
  2. One handles view changes.
  3. One updates state based on built blocks.

If event order assumptions aren’t enforced, a proposal may be validated against stale state — allowing invalid updates to pass.

Mitigation Summary:


Final Thoughts

These issues don’t typically appear at first. They arise with load, integration, and evolving systems. Our goal isn’t to teach BFT, but to warn about what can break even when it works.


Looking for the reports?
👉 https://github.com/0xkato/Portfolio/tree/main/Espresso

Shoutout to Jarred for being my co-author on this post.

comments powered by Disqus