Your BFT Protocol Will Break in Production
- 4 minsSince there are a few teams building variations of BFT consensus protocols like Espresso, Monad, Plasma, Aptos and Sui, we thought we would write a blog on some of the security problems that have arisen during our time working on Espresso’s Hotshot, which was originally an implementation of HotStuff and is now an implementation of HotStuff-2.
Table of Contents
- Quick explanation of BFT
- Quick explanation of HotStuff
- Terminology
- Classes of Failures
- Security Flaws
- Final Thoughts
Quick Explanation of BFT
Byzantine Fault Tolerant (BFT) consensus is a class of algorithm dedicated to the resolution of conflicts in a system where each node is treated as adversarial. The problem in distributed systems derives from the Byzantine Generals Problem, which seeks to quantify the behavior of potentially malicious parties within a distributed network.
These systems are mathematically secure, typically employing proofs of the security of the protocol which are rigorously evaluated via the peer review process. Contemporary BFT protocols deployed within Web3 environments rely on a bunch of different security assumptions which, if violated, can compromise the network. A non-exhaustive list is as follows:
- Honest Majority: So long as a majority of quorum members are “honest,” the protocol is safe from tampering by byzantine actors.
- Cryptographic Security of Messages: Messages are signed by the party using key pairs.
- Equivocation Resistance: The protocol should detect and ideally penalize nodes that equivocate (i.e., send conflicting messages to different parties).
Quick Explanation of HotStuff
HotStuff is a modern BFT consensus protocol that streamlines agreement across network nodes using a pipelined three-phase commit: prepare, pre-commit, and commit. Each view has a single leader responsible for consensus in that round.
Leader-heavy BFT, like HotStuff, defers protocol progression to the leader at a given step. Its pipelined approach and linear communication complexity make it scalable. HotStuff2 is an evolution of this protocol, hinging on the so-called “chain rule.”
Terminology
- View: A round of consensus where one leader proposes and other nodes vote.
- Leader / proposer: The node responsible for making the proposal.
- Voter: Every node except the current leader.
- Quorum: Minimum number of participants required to reach agreement.
- Proposal: The value a leader suggests for consensus.
- Certificate: Proof that a quorum has been reached for a given proposal.
Classes of Failures
Rather than rare bugs like unsigned proposals, we focus on halts — situations where the network can no longer progress.
A halt means the view or round stops increasing. While timeout mechanisms should rotate the leader and avoid this, halts still occur due to:
- First-order issues: Logic bugs, panics, race conditions.
- Second-order issues: External service outages, third-party library failures.
- Integration errors: Mismatched assumptions between the BFT protocol and dependent systems (e.g., fixed interval data submissions).
We call these mismatches rate-limiting steps — bottlenecks outside the protocol that determine its effective speed.
Security Flaws
Halts and DoS
Condition #1: Signature Replay
Imagine a signed proposal being replayed. Even if the proposal includes a view number, improper validation can allow it.
In pipelined HotStuff, views progress concurrently. A malicious node might send a replayed proposal with a valid aggregate signature for a future view, disrupting the pipeline.
Failure Modes:
- Passive voter: Votes blindly or locks onto the first proposal.
- Incompetent leader: Accepts old signed proposals without proper checks.
Mitigation Summary:
Each node should:
- Derive the expected proposal independently.
- Validate proposal structure before signature check.
- Enforce strong temporal constraints (not just view number) in signatures.
Condition #2: Deferred Validation
A halt can occur if a proposal passes voting before critical data is validated.
Example: A proposer sets the latest fetched L1 block to a massive number. If this isn’t validated, the next proposer cannot proceed until that number is reached — halting the network.
This is often a programming oversight, not a protocol failure.
Mitigation Summary:
All proposal data must be validated before quorum formation. Malicious nodes or custom implementations that skip checks should be caught at this stage.
Implementation Errors
Race Condition in State Update
Three functions exist:
- One updates state.
- One handles view changes.
- One updates state based on built blocks.
If event order assumptions aren’t enforced, a proposal may be validated against stale state — allowing invalid updates to pass.
Mitigation Summary:
- Document read/update flows.
- Add byzantine-focused tests.
- Use locks or guards to enforce order where needed.
Final Thoughts
These issues don’t typically appear at first. They arise with load, integration, and evolving systems. Our goal isn’t to teach BFT, but to warn about what can break even when it works.
Looking for the reports?
👉 https://github.com/0xkato/Portfolio/tree/main/Espresso
Shoutout to Jarred for being my co-author on this post.