Nan Wang
Nan Wang

Reputation: 65

How to estimate total number of nodes in a P2P network?

I'm creating a p2p network without master node. How to count the total live nodes? Precision is not the key but lightweight and performance is important.

Currently I'm thinking two approaches, both of them have big drawback...

  1. Each node has a random number as its ID. Each node has a huge bit array, initially only the bit at index == ID is 1 and others 0. Each node exchange it's bit array with known peers. The bit array of a node is merged with peers using bit OR. As this process goes, eventually every node should have similar bit array, and the number of 1 indicates number of nodes in the network. The advantage is that this can be done in parallel and along the time. At query time the response can be very quick (because the result is already there). Disadvantages: a). Hard to handle nodes that are gone. b). The bit array is too big if there are millions of nodes.

  2. Something like BFS. From an initial node, it asks all known peers how many nodes are connected. Then sum all responses as the total size of the to-be-probed network. This is a recursion approach. If a node already received such a query, it will ignore futher same query from other nodes. Just like BFS. The disadvantage is that this has a high probability to be inaccurate. Looking at the BFS tree, for every node it can be single point failure, which results all query from that node missing. Consider initializing query from node A, querying connected peers B & C. B & C then spread the query to the network they connect. Due to some reason C fails and can not response to A. Then all nodes received direct or indirect query from C will not be counted. This issue can happen at any node of the network and could lead to minimal inaccuracy, or big inaccuracy, as in the aforementioned approach, probably 50% of the network.

Any idea how to practically estimate total nodes in P2P network?

Upvotes: 2

Views: 1237

Answers (2)

user555045
user555045

Reputation: 64913

If you have a DHT or similar overlay network (which you probably do, how else are you going to have a purely distributed network?), you can estimate its size by looking at the distribution of IDs in a small sub-space of the ID space. Your approaches essentially try to be accurate, so they end up being impractical. Looking at only a part of the ID space and drawing conclusions about the rest of it is clearly not accurate in general, it only works if IDs are random and if you know about enough nodes that you can invoke the law of large numbers. If you keep appropriate routing tables you can make this estimate purely locally, but if your queries allow it you could decide to ask for nodes close to some randomly generated ID and now you have two estimates that you can average (but that's an expensive trick). There are many papers about making this more resilient to certain attacks, more accurate in case of high churn, and with actual formulas for the estimate.

Upvotes: 2

Sorin
Sorin

Reputation: 11968

There's no perfect approach. If there's a node on a critical part that is flaky (goes down and up often) then there's a large chance that you miss part of the network because of that and no telling how big that part is. Even assuming a well connected network it's hard to give a correct count since the state of the network may change while the algorithm is running.

You probably don't want your second approach because you put more weight on the nodes close to the starting point, just like you said.

However here's a different take on the first approach that does away with the bit array being too long at the cost of some accuracy.

The idea is very similar but instead of having to count all your nodes (your ids must be unique and fairly consecutive) and a very large bit array, you can have random ids (each machine picks a random 32/64 bit id) and instead of the bit array you pass along a bloom filter and a counter. The bloom filter is similar to the array in that it can say if the id has not yet been inserted. However it's much more space efficient. The downside is that while it has 100% recall (if the filter says the id is not there, it really is not there) it doesn't have 100% precision (the filter may say that the id has been inserted but in reality it wasn't). For the algorithm this means that you are sure to avoid double counting (good) but depending on the order of insertions you might miss some nodes.

The rest of the algorithm is the same. Start by inserting your own id and set the counter to 1. Send the state to all peers. When you receive a state, check if your id has been already inserted and if not insert it, add 1 and send the state to all your peers. If you have the same bloom filter state with different counts, pick the larger one (since we're sure there's no double counting). Eventually the states will converge and that's your answer.

Here's a link on how a bloom filter works.

Upvotes: 3

Related Questions