Nodes form a peer-to-peer network, constantly communicating with each other to reach consensus about the state of the Blockchain. A node is not necessarily a single physical machine, but it appears as a single logical entity to the rest of their peers by having a unique ID and address where it responds to requests.
Nodes periodically try to discover each other based on elements of the Kademlia protocol. Unlike the original Kademlia which was using UDP, nodes are using point-to-point gRPC calls for communication. The specifics can be found under Kademlia API. According to this protocol every
Node has the following properties:
id is a Keccak-256 digest of the Public Key from the SSL certificate of the node
host is the public endpoint where the node is reachable
discovery_port is where the gRPC service implementing the
KademliaService is listening
protocol_port is where the gRPC service implementing the consensus related functionality is listening
KademliaService itself has to implement only two methods:
Ping is used by the
sender to check if the callee is still alive. It also gives the callee the chance to update its list of peers and remember that it has seen the
Lookup asks the node to return a list of Nodes that are closest to the
sender is looking for, where distance between two Nodes based on the longest common prefix of bits in their
At startup the nodes should be configured with the address of a well known peer to bootstrap themselves from. To discover other nodes they can pick from multiple strategies:
Perform one-time lookup on their own
id by the bootstrap node (which doesn't know them yet) to receive a list of peers closest to itself. Recursively perform the same lookup with those peers to accumulate more and more addresses until there are nothing new to add.
Periodically construct artificial keys to try to find peers at certain distances from
id and perform a lookup by a random node.
Nodes propose Blocks in parallel by finding Deploys that can be applied independently of each other. Whenever a new Block is formed, it has to propagate through the network to become part of the consensus. This is achieved by nodes making calls to each other via gRPC to invoke methods on their
GossipService interface which should be listening on the
protocol_port of the
Node that represents the peers in the network. The details of the service can be seen under Gossiping API.
We call the method by which information is disseminated on the network gossiping. Gossiping means that when a node comes across new bits of information it will relay it to a selection of its peers, who do the same, eventually saturating the network, i.e. get to the point where everyone has been notified.
Nodes need three layers of information about Blocks to be able to participate in the consensus:
Block meta-data, e.g. parent relationships, validator weights, state hashes.
Deploys that were included in given Block.
Global State, to be able to run the Deploys, validate Blocks and build new ones on top of them.
Out of these only the top two are gossiped between nodes; the Global State they have to calculate themselves by running Deploys.
We have the following requirements from our gossiping approach:
It should be efficient, i.e. minimise the network traffic while maximising the rate at which we reach full saturation.
Node operators should have a reasonable expectation that network traffic (a finite resource) will scale linearly with the amount of Deploys across the network while being less affected by the total number of nodes. This means the load should be distributed among the peers rather than create hotspots.
To achieve these we have the following high level approach:
Gossip only the meta-data about the Blocks to minimise the amount of data transfer.
Full Blocks can be served on demand when the gossiped meta-data is new.
Nodes should pick a relay factor according to how much network traffic they can handle and find that many node to gossip to, nodes for which the information is new.
Nodes should pick a relay saturation target beyond which point they don't try to push to new peers so the last ones to get a message don't have to contact every other peer in futility.
Nodes should try to spread the information mostly to their closer neighbours (in terms of Kademlia distance) but also to their farther away peers to accelerate the spread of information to the far reaches of the network.
As we established in the Node Discovery section the nodes maintain a list of peers using a Kademlia table. The table has one bucket for each possible distance between the bits of their IDs, and they pick a number k to be length of the list of nodes in each of these buckets that they keep track of.
In statistical terms, half of the nodes in the network will fall in the first bucket, since the first bit of their ID will either be 0 or 1. Similarly each subsequent bucket holds half of the remainder of the network. Since k is the same for each bucket, this means that nodes can track many more of their closest neighbours than the ones which are far away from them. This is why in Kademlia, as we perform lookups, we get closer and closer to the best possible match anyone knows about.
In practice this means that if we have a network of 50 nodes and a k of 10, then from the perspective of any given peer, 25 nodes fall into the first bucket, but it only tracks 10 of them, ergo it will never gossip to 15 nodes from the other half that it doesn't have the room to track in its table.
We can use this skewness to our advantage: say we pick a relay factor of 3; if we make sure to always notify 1 node in the farthest non-empty buckets, and 2 in the closer neighbourhood, we can increase the chance that the information gets to those Nodes on the other half of the board that we don't track.
The following diagram illustrates this. The black actor on the left represents our node and the vertical partitions represent the distance from it. The split in the middle means the right half of the board falls under a single bucket in the Kademlia table. The black dots on in the board are the nodes we track, the greys are ones we don't. We know few peers from the right half of the network, but much more on the left, because it's covered by finer and finer grained buckets. If we pick nodes evenly across the full distance spectrum to gossip to, and the Node on the right follows the same rule, it will start distributing our message on that side of the board with a slightly higher chance than bouncing it back to the left.
In terms of probabilities of reaching a grey node in the 2nd round, or just the message being on the right side of the board in the 1st or 2nd round of message passing, the gains are marginal and depend on how many peers there are in the Kademlia table. We could give higher weights to the buckets that reach the untracked parts of the network, but the effects will have to be simulated.
In practice the messages don't have to reach every node on the network. Achieving 100% saturation would be impractical as it would require a high level of redundancy, i.e. a node receiving the same message multiple times from different peers. Tracking who saw a message could bloat the message size or open it up to tampering. But even if a node isn't notified about a particular Block right now, it has equal chances of receiving the next Block that builds on top of that, at which point it can catch up with the missing chain.
Therefore nodes should have a relay saturation value beyond which they don't try to gossip a message to new nodes. For example if we pick a relay factor of 5 and a relay saturation of 80% then it's enough to try and send to 25 nodes maximum. If we find less than 5 peers among them to whom the information was new then we achieved a saturation beyond 80%. This prevents the situation when the last node to get the message has to contact every other node in a futile attempt to spread it 5 more times. Assuming that every node tracks a random subset of peers in the network, the saturation we observe in the nodes we try to contact is an approximation of the saturation in the whole network, with the accuracy depending on how many nodes we tried.
algorithm BlockGossip isinput: message M to send,relay factor rf,relay saturation rs,kademlia table Koutput: number of messages sent ss <- 0P <- flatten the peers in K, ordered by distance from current nodeG <- partition P into R equal sized groupsn <- an empty set to track notified peersm <- rf / (1 - rs) // the maximum number of peers to try to send toi <- 0 // the index of the current groupwhile i < sizeof(G) and s <= rf and sizeof(n) < m:p <- a random peer in G(i)-n or None if emptyif p is None theni <- i + 1elsen <- n + pr <- the result of sending M to p, indicating whether M was new to pif r is true theni <- i + 1s <- s + 1return s
We can rightfully ask how the gossip algorithm outlined above fares in the face of malicious actors that don't want to take their share in the data distribution, i.e. what happens if a node decides not to propagate the messages?
The consensus protocol has a built-in protection against lazy validators: to get their fees from a Block produced by somebody else they have to build on top of it. When they do that, they have to gossip about it, otherwise it will not become part of the DAG or it can get orphaned if conflicting blocks emerge, so it's in everyone's interest for gossiping to happen at a steady pace.
What if they decide to announce their Blocks to everyone but never relay other Blocks from other nodes? They have a few incentives against doing this:
If everybody would be doing it then the nodes unknown to the creators would get it much later and might produce conflicting blocks, the consensus would slow down.
When they finally announce a Block they built everyone would try to download it directly from them, putting extra load on their networks, plus they might have to download extra Blocks that the node failed to relay before.
If we have to relay to a 100 nodes directly, it could easily to take longer for each of the 100 to download it from 1 node then for 10 nodes to do so and then relay to 10 more nodes each.
Having a relay factor together with the mechanism of returning whether the information was new has the following purpose:
By indicating that the information was new the callee is signalling to the caller that once it has done the validation of the Block it will relay the information, therefore the caller can be content that by informing this node it carried out the number of gossips it set out to do, i.e. it will have to serve the full Blocks up to R number of times.
By indicating that the information wasn't new, the callee is signalling that it will not relay the information any longer, therefore the caller should pick another node if it wants to live up to its pledge of relaying to R number of new nodes.
Nodes expect the ones which indicated that the Block meta-data was new to them to later attempt to download the full Blocks. This may not happen, as other nodes may notify them too, in which case they can download some Blocks from here, some from there.
There are two forms of lying that can happen here:
The callee can say the information wasn't anything new, but then attempt to download the data anyway. Nodes may disincentivise this by tracking each others reputation and block nodes that lied to them.
The callee can say the information was new but not relay. This goes against their own interest as well, but it's difficult to detect. A higher relay factor can compensate for the amount of liars on the network.
Nodes may also use reputation tracking and blocking if they receive notifications about Blocks which cannot be validated or which the notifier isn't able to serve when asked.
Let's take a closer look at how the methods supported by the
GossipService can be used to spread the information about Blocks between nodes.
When a node creates one or more new Blocks, it should pick a number of peers according to its relay factor and call
NewBlocks on them, passing them the new
block_hashes. The peers check whether they already have the corresponding blocks: if not they indicate that the information is new and schedule a download from the
sender, otherwise the caller looks for other peers to notify.
By only sending block hashes we can keep the message size to a minimum. Even block headers need to contain a lot of information for nodes to be able to do basic verification; there's no need to send all that if the receiving end already knows about it.
When a node receives a
NewBlock request about hashes it didn't know about, it must synchronise its Block DAG with the
sender. One way to do this is to have some kind of download manager running in the node which:
maintains partially connected DAG of
BlockSummary records that it has seen
tries to connect the new bits to the existing ones by downloading them from the senders
tracks which nodes notified it about each Block, to know alternative sources to download from
tracks which Blocks it promised to relay to other nodes
downloads and verifies full Blocks
notifies peers about validated Blocks if it promised to do so
StreamAncestorBlockSummaries is a high level method that the caller node can use to ask another for a way to get to the block it just shouted about. It's a method to traverse from the target block backwards along its parents until every ancestor path can be connected to the DAG of the caller. It has the following parameters:
target_block_hashes is typically the hashes of the new Blocks the node was notified about, but if multiple iterations are needed to find the connection points then they can be further back the DAG.
known_block_hashes can be supplied by the caller to provide an early exist criteria for the traversal. These can for example include the hashes close to the tip of the callers DAG, forks, last Blocks seen from validators, and approved Blocks (i.e. Blocks with a high safety threshold).
max_depth can be supplied by the caller to limit traversal in case the
known_block_hashes don't stop it earlier. This can be useful during iterations when we have to go back beyond the callers approved blocks, in which case it might be difficult to pick known hashes.
The result should be a partial traversal of the DAG in reverse BFS order returning a stream of
BlockSummaries that the caller can partially verify, merge into its DAG of pending Blocks, then recursively call
StreamAncestorBlockSummaries on any Block that didn't connect with a known part of the DAG. Ultimately all paths lead back to the Genesis or last checkpointed Block so eventually we should find the connection, or the caller can decide to give up pursuing a potentially false lead from a malicious actor.
The following diagram illustrates the algorithm. The dots in the graph represent the Blocks; the ones with thicker outer ring are the ones passed as
target_block_hashes. The dashed rectangles are what's being returned in a stream from one invocation to
Initially the we only know about the black Blocks, which form our DAG.
We are notified about the white Block, which is not yet part of our DAG.
StreamAncestorBlockSummaries passing the white Block's hash as target and a
max_depth of 2 (passing some of our known block hashes as well).
We get a stream of two
BlockSummary records in reverse order from the 1st and we add them to our DAG. But we can see that the grandparents of the of the white Block are still not known to us.
StreamAncestorBlockSummaries a 2nd time passing the grandparents' hashes as targets.
From the 2nd stream we can see that at least one of the Blocks is connected to the tip our DAG, but there are again Blocks with missing dependencies.
StreamAncestorBlockSummaries a 3rd time and now we can form a full connection with the known parts of the DAG, there are no more Blocks with missing parents.
The following algorithm describes the server's role:
algorithm StreamAncestorBlockSummaries isinput: target block hashes T,known block hashes K,maximum depth m,block summary map Boutput: stream of ancestry DAG in child to parent orderG <- an empty DAG of block hashesQ <- an empty priority queue of (depth, hash) pairsA <- an empty list of ancestor block summariesV <- an empty set of visited block hashesfor each hash h in T dopush (0, h) to Qwhile Q is not empty dopop (d, h) from Qif h in V thencontinueif h is not in B thencontinueb <- B(h)A <- A + bV <- V + hfor each parent hash p of b doG <- G + (h, p)if d < m and p not in K dopush (d+1, p) to Qreturn stream of A
And the next one depicts syncing DAGs from the client's perspective:
algorithm SyncDAG isinput: sender node s,new block hashes N,block summary map Boutput: new block summaries sorted in topological order from parent to childG <- an empty DAG of block hashesA <- an empty map of block summariesm <- a suitable maximum depth, say 100function Traverse isinput: block hashes Houtput: number of summaries traversedS <- s.StreamAncestorBlockSummaries(H, m)for each block summary b in S doif b cannot be validated thenreturn 0if b is not connected to H with a path or the path is longer than m thenreturn 0if we see an abnormally shallow and wide DAG being built then// the server is trying to feed exponential amount of data// by branching wide while staying within the maximum depthreturn 0h <- the block hash of bA(h) <- bfor each parent p of b doif p is not in B doG <- G + (p, h)return sizeof(S)Traverse(N)define "hashes in G having missing dependencies" ashash h having no parent in G and h is not in Bwhile there are hashes in G with missing dependencies doH <- the hashes in G with missing dependenciesif Traverse(H) equals 0 thenbreakif there are hashes in G with missing dependencies thenreturn empty because the DAG did not connectelsereturn A(h) for hashes h in G sorted in topoligical order from parent to child
SyncDAG needs to have some protection against malicious actors trying to lead it down the garden path and feeding it infinite streams of data.
SyncDAG indicates that the summaries from the
NewBlocks have a common ancestry with the DAG we have, we can schedule the download of data.
Q <- an empty queue of blocks to syncG <- an empty DAG of block dependencies in lieu of a download queueS <- a map of source information we keep about blocksGBS <- the global block summary mapGFB <- the global full block mapfunction NewBlocks isinput: sender node s,new block hashes Noutput: flag indicating if the information was newH <- find hashes h in N where h is not in GBSif H is not empty thenpush (s, H) to Qreturn truereturn falseparallel threads Synchronizer isfor each (s, N) message m in Q doD <- SyncDAG(s, N, GBS)for each block summary b in D dor <- if hash of b is in N then true else falseScheduleDownload(b, s, r)function ScheduleDownload isinput: block summary b,sender node s,relay flag rh <- the hash of bif h is in GBS thenreturnif h is in S thenS(h) <- S(h) with extra source sif r is true thenS(h) <- S(h) with relay trueelseN <- the list of potential source nodes for b with single element sS(h) <- (b, N, r)if any parent p of b is in pending downloads G thenadd h as a dependant of p in Gelseadd h to G without dependenciesparallel threads Downloader isrf <- the relay factor from config, say 5rs <- the relay saturation from config, say 0.8for each new item added to G or after an idle timeout dowhile we can mark a new hash h in G without dependencies as being downloaded do(b, N, r) <- S(h)n <- a random node in Nf <- the full block returned by n.GetBlockChunked(h)if f is valid thenGFB(h) <- fGBS(h) <- bremove h from Gif r is true thenK <- the current Kademlia table of peerss <- the current nodeBlockGossip(NewBlocks(s, h), rf, rs, K)
Full Blocks containing all the deploys can get big, and the HTTP/2 protocol underlying gRPC has limits on the maximum message size it can transmit in a request body, therefore we need to break the payload up into smaller chunks, transfer them as streams and reconstruct them on the other side.
The caller should keep track of the data it receives and compare them to the
content_length it got initially in the header to make sure it's not being fed an infinite stream of data.
In theory the method could return a stream of multiple blocks, but asking for them one by one from multiple peers as the notifications arrive to the node about alternative sources should be favoured over downloading from a single source anyway.
When a new node joins the network it should ask one or more of its peers about the tips of their DAG, i.e. the Blocks on which they themselves would be proposing new blocks. This can be followed by an arbitrary number of calls to the
StreamAncestorBlockSummaries until the new node has downloaded and partially verified the full DAG using the algorithms outlined above. It's worth cross-correlating the tips of multiple nodes to avoid being lied to by any single malicious actor.
Finally the following sequence diagram demonstrates the life cycle of Block propagation among nodes. The dashed blocks have been left unconnected for brevity but they do the same thing as the ones on the left side.