How to estimate total number of nodes in a P2P network?

Question

I'm creating a p2p network without master node. How to count the total live nodes? Precision is not the key but lightweight and performance is important.

Currently I'm thinking two approaches, both of them have big drawback...

Each node has a random number as its ID. Each node has a huge bit array, initially only the bit at index == ID is 1 and others 0. Each node exchange it's bit array with known peers. The bit array of a node is merged with peers using bit OR. As this process goes, eventually every node should have similar bit array, and the number of 1 indicates number of nodes in the network. The advantage is that this can be done in parallel and along the time. At query time the response can be very quick (because the result is already there). Disadvantages: a). Hard to handle nodes that are gone. b). The bit array is too big if there are millions of nodes.
Something like BFS. From an initial node, it asks all known peers how many nodes are connected. Then sum all responses as the total size of the to-be-probed network. This is a recursion approach. If a node already received such a query, it will ignore futher same query from other nodes. Just like BFS. The disadvantage is that this has a high probability to be inaccurate. Looking at the BFS tree, for every node it can be single point failure, which results all query from that node missing. Consider initializing query from node A, querying connected peers B & C. B & C then spread the query to the network they connect. Due to some reason C fails and can not response to A. Then all nodes received direct or indirect query from C will not be counted. This issue can happen at any node of the network and could lead to minimal inaccuracy, or big inaccuracy, as in the aforementioned approach, probably 50% of the network.

Any idea how to practically estimate total nodes in P2P network?

Sorin · Accepted Answer

There's no perfect approach. If there's a node on a critical part that is flaky (goes down and up often) then there's a large chance that you miss part of the network because of that and no telling how big that part is. Even assuming a well connected network it's hard to give a correct count since the state of the network may change while the algorithm is running.

You probably don't want your second approach because you put more weight on the nodes close to the starting point, just like you said.

However here's a different take on the first approach that does away with the bit array being too long at the cost of some accuracy.

The idea is very similar but instead of having to count all your nodes (your ids must be unique and fairly consecutive) and a very large bit array, you can have random ids (each machine picks a random 32/64 bit id) and instead of the bit array you pass along a bloom filter and a counter. The bloom filter is similar to the array in that it can say if the id has not yet been inserted. However it's much more space efficient. The downside is that while it has 100% recall (if the filter says the id is not there, it really is not there) it doesn't have 100% precision (the filter may say that the id has been inserted but in reality it wasn't). For the algorithm this means that you are sure to avoid double counting (good) but depending on the order of insertions you might miss some nodes.

The rest of the algorithm is the same. Start by inserting your own id and set the counter to 1. Send the state to all peers. When you receive a state, check if your id has been already inserted and if not insert it, add 1 and send the state to all your peers. If you have the same bloom filter state with different counts, pick the larger one (since we're sure there's no double counting). Eventually the states will converge and that's your answer.

Here's a link on how a bloom filter works.

How to estimate total number of nodes in a P2P network?

Answers (2)

Related Questions