Building a resilient PoA blockchain with Substrate

Image Source: Unsplash

The Substrate framework provides modular components for building highly customizable and extensible blockchains and blockchain-based solutions. Some of the most popular public networks using the Substrate framework are Polkadot, Kusama, and several of their parachains.

A blockchain built using Substrate is modular and any component at mostly any layer of the stack can be replaced or enhanced as per the needs. You can have custom consensus, hashing and signing algorithms, transaction validations, and much more.

In this post, I am giving a walkthrough of how I built a self-healing PoA (Proof of Authority) blockchain that can automatically recover the block period when nodes drop connections or don’t produce blocks for any reason.

In public PoS (Proof of Stake) systems, if a validator does not produce blocks at their assigned slots, they are generally slashed as a punishment. Also, in public networks the number of validators is relatively large. Even if a couple of them skip their slot, the average block period is not impacted too much. In case of permissioned PoA networks this is a bit tricky as we don’t have the concept of slashing (or staking for that matter), and the count of validators is relatively smaller. Even if one of them skips their slot and goes offline, it impacts the block period by quite a bit and that results in latency and bad user experience.

For example, lets say we have a permissioned PoA network with 10 authorities and a 6-seconds block period. Lets say one of these authorities goes down (offline) and does not produce blocks at its assigned slot(s). That would impact the block time of the overall chain by 10%, and to produce 10 blocks it would take at least 66 seconds instead of 60. If another authority goes down, the time taken to produce 10 blocks reaches 72 seconds, and so on.

One way to solve this problem is to kick these offline authorities out of the active set, and then the network would not wait for them to produce blocks. Their slots would be redistributed among other authorities. The average block time would then recover. Lets do this in code.

Substrate PoA Network

Now before we solve the block period recovery problem, lets first understand how to create a PoA network in Substrate. For sake of simplicity and to avoid confusion, I will use the term validator, for both authority and validator, in this blog post going forward.

In Substrate, the validator management is done using the Session pallet. The Session pallet defines the traits that, when implemented, allow other pallets to provide a new set of validators, and to subscribe to when the validator set changes. Basically, the Session pallet (sort of) sits on top of block production and block finality pallets (consensus), and keeps track of validator set. The validator set is provided by the Session pallet to the downstream consensus pallets at regular intervals called sessions. A session is basically a set of blocks and used as a unit of time. The number of blocks in a session are configurable for each blockchain.

Now that we have seen how the Session pallet does validator set management and provides them to the consensus pallets, lets look at where does it get the validators from. In a PoS system, the list of validators comes from the logic based on the stake. In Polkadot and Kusama, this is done in another pallet (module) which runs an election for all eligible validators (according to the stake), and then passes the list of elected validators to the Session pallet. The election is done at regular intervals, and the validator set keeps changing.

But in permissioned PoA networks, where the number of validators is small and addition and removal of validators is rather infrequent, we need a different approach. Also, in PoA networks there is no concept of stake and hence we simply cannot follow the same process that we have for PoS systems.

What if we could start a chain with a set of initial validators, and then add/remove more validators using a transaction on the chain? With a multi-signature transaction or a council based governance process, the existing validators could decide who to add in the validator set and who to remove. Wouldn’t that make validator management simpler yet dynamic for PoA networks? I guess, it would. And that’s why I built a pallet to do the same.

Validator Set Pallet

The Validator Set pallet allows addition and removal of validators using transactions in a Substrate-based PoA network. The implementation is simple: we keep a list (more like a local cache) of the validators in the storage of the pallet, and the transaction logic adds/removes validators from this list. This list is then passed to the Session pallet by implementing the session management traits that I mentioned before. Simple, right?

Removal of Offline Validators

Now that we are through the validator management part, lets look at the resiliency part (block period recovery) when one or more validators go offline.

The first thing we need is a way to find out if and when a validator went offline. We need some sort of event, which can be subscribed, that triggers when a validator stops producing blocks or goes down.

That is exactly what we get from the Substrate ImOnline pallet. When included in a runtime, the ImOnline pallet sends heartbeats from every validator. The heartbeats are sent as transactions, once each session or era. The pallet also keeps a track of these heartbeats, and triggers an event when it does not receive heartbeats from any of the validators. This way we can identify if and when one or more validators went offline.

The Validator Set pallet implements a trait ReportOffence from the ImOnline pallet. The ImOnline pallet calls the report_offence function of this trait at the end of a session, if it does not receive heartbeats from any validators. The function parameters have the information about all validators who reported offline during the last session. That’s it, this is what we were looking for, all along.

Once we know which validators have been reported offline, we can remove them from the validator set just like we would remove any other validator using the Validator Set pallet. All this happens automatically by implementing the said trait, and calling the remove function from the implementation. The following code snippet briefly shows this logic.

fn report_offence(_reporters: Vec<T::AccountId>, offence: O) -> Result<(), OffenceError> {
    let offenders = offence.offenders(); // offline validators

    for (v, _) in offenders.into_iter() {
        Self::mark_for_removal(v); // removal
    }

    Ok(())
}

As soon as the new validator set (after removal of offline validators) comes into effect, once the Session pallet rotates the session, the block period of the network goes back to normal. At this time the network would have adjusted the block production slots as per the new number of active validators, and the configured target block period.

So, it seems we have solved the problem and made the PoA network resilient. Not quite. We have only solved half the problem so far. What happens when one by one many validators go offline and most of them are removed from the network? That would reduce the decentralization quite a bit. Lets solve this too.

What happens when a validator comes back online?

Ok, so we have removed the offline validators. Now they won’t stay offline forever, right? Nobody would want that. The responsible devops team(s), at some point of time, would bring their validator(s) online (hopefully).

Also, remember, these validators were added to the network after agreement among all existing validators (through multi-sig or council transactions). They still hold the right to produce blocks until the network kicks them out explicitly. So, when these validators come back online, they should be added back to the network.

Hold on, it is not that simple, yet. Once you remove a validator and the updated validator set becomes active, the block producing slots are redistributed among active online validators. So, when a validator, after its removal, comes back online, it needs to be added again to the network through a transaction, and wait for the updated validator set to become active after the session rotation. That’s a whole ceremony to be repeated, for all the good network security reasons.

To make this a bit simple, the Validator Set pallet keeps a track of all the validators added to the network which are not explicitly removed (yet). If a validator is removed because of being offline, and not by sending a removal transaction, it is still considered approved. It can be added back, without council or multi-sig, when it comes online again. For this, there is a separate transaction type which can only be called by this validator’s AccountId, to add itself back to the network.

But we have another problem.

Once these validators have been automatically removed from the validator set, they cannot produce blocks anymore, even if they come back online, which is obvious. They can only sync with the network as a non-validator full node.

The heartbeat logic of the ImOnline pallet only works for active, online validators. It does not work for any other full node. Hence, we cannot find out if a validator came back online through the ImOnline logic. So, in this case we should rather depend on an off-chain devops process.

What we can do is run an off-chain service (listener) that keeps pinging validators after they are removed from the validator set. If and when they come back online, this service would know, and it could then add them back to the validator set by sending a transaction to the Validator Set pallet. Off course this last part of adding a validator again is a bit of a hack, but in a permissioned PoA network we should be able to live with it, I think.

Show me the code!

Blockchain, Decentralization, Substrate, Parity, Web3, PoA
comments powered by Disqus