Reputation: 87
I'm now learning Kademlia network by reading the classical paper Kademlia: A Peer-to-peer Information System Based on the XOR Metric. I want to understand the complexity of its operation but still cannot figure it out.
In the 3 Sketch of proof section, the paper gives two definitions:
And three conclusions:
So my questions are:
Upvotes: 4
Views: 1321
Reputation: 1543
The accepted answer is great!
Consider things from a particular node u's perspective.
For your k closest nodes, you have complete information of the neighbourhood. This is because you are storing items in buckets of size k and there are not enough keys to go in to those buckets so essentially the lower leaves will always have empty k buckets. So you will know all your k closest neighbours.
Now how high is the kth closest neighbour in the tree. there is 1 key 1 bit away(final bit differing) there are 2 keys 2 bits away(Last two bits differing) there are 4 keys 3 bits away(Last 3 bits differing) so the height n of the kth closest node is
1 + 2 + 4 + ... + 2^n = k
=> 2^n = k + 1
=> n = log(k+1)
So the kth distant node is at height log(k)
What this tells us is that when you get a search for a node whose distance is <= logk(height of the kth node).We can answer immediately since we know the complete neighbourhood and we don't need to spend logk steps obtaining 1 bit of information in each step as we need to do when the requested node is farther away.
So when we do a search for a node whose depth is h. You will query nodes whose depth decreases by 1 in the worst case till you reach a node for which the requested node's depth is log k and that node can immediately answer the query.
2. To mathematically answer your first question, with overwhelming probability the nodes height is O(log n)
Consider a network where the keys are M bits and there are N nodes in the network. Now we are looking at the routing tree from a particular node u's perspective. This tree will be crowded towards the higher order bits since 1/2 the possible keys fall in top bucket, 1/4 in second and so on.
So what is the probability that your first q slots
in the routing tree with distance
from 2^0 - 2^1 to 2^q-1 - 2^q are empty.
This requires that all the N nodes fall in the buckets greater than q
To select a key in bucket greater than q you
need to ensure that your maximum prefix match is less than M-q.
So there are 2^M total keys of which 2^q keys
have the same prefix of length (M-q) as the node u.
So the favourable cases are 2^M - 2^q.
Total cases are 2^M
Assume all N key draws are independent
So the probability that q lowest slots are empty is (1 - 1/2^(M-q))^N
So we plug in q = M - clog(n) which would mean
that there are clog(n) filled buckets
with M-clog(N) lower buckets empty
P = (1-1/2^(clog(N)))^N
= (1-1/N^c)^N
this is approximately equal to
1-1/N^(c-1)
And so the probability goes to 1 as c increases and we are very likely to have only clog(n) top slots filled in the routing table.
Upvotes: 0
Reputation: 43125
It has been a while since I've actually read the paper, so I'm mostly piecing this together from my implementation experience instead of trying to match the concepts that I have in my head to the formal definitions in the paper, so take the following with a little grain of salt
What is "least significant empty bucket" and "most significant k-buckets"?
That basically refers to the buckets sorted by XOR distance relative to the node's own ID
How to explain the depth and bucket height in visual way?
Each bucket covers a keyspace region. E.g. from 0x0000simplified to 2 bytes to 0x0FFF for 1/16th of the keyspace. This can be expressed in CIDR-like masks, 0x0/4 (4 prefix bits). That's more or less the depth of a bucket.
There are several ways to organize a routing table. The "correct" way is to represent it as tree or sorted list based on the lowest ID represented by a bucket. This approach allows for arbitrary bucket split operations as some routing table optimizations call for and can also be used to implement node multihoming.
A simplified implementation may instead use a fixed-length array and put each bucket at the position of shared prefix bits relative to the node's own ID. I.e. position 0 in the array will have 0 shared prefix bits, it's the most-distant bucket, the bucket covering 50% of the keyspace and the bucket where the most significant bit is the inverted MSB of the node's own ID.
In that case the depth is simply the array position.
How to understand the second and third conclusions, say, why log k and h - log k?
Say you are looking for an ID that is the furthest away from your own node's ID. Then you will only have one bucket covering that part of the keyspace, namely it will cover half the keyspace with the most significant bit differing from yours. So you ask one (or several) nodes from that bucket. By virtue of their ID bits having the first bit in common with your lookup target they are more or less guaranteed to have split that in two or more, i.e. have at least double the keyspace coverage for the target space. So they can provide at least 1 bit better information.
As you query closer nodes to the target they will also have better keyspace coverage near the target region because that's also closer to their own node ID.
Rinse, repeat until there are no closer nodes to be found.
Since each hop shaves off at least 1 bit of distance at a time you basically need a O(log(n)) hop count where n is the network size. Since network size basically dictates the average distance between nodes and thus bucket depth needed for your home bucket.
If the target key is closer to your own ID you will need fewer steps since you will have better coverage of that region of the keyspace.
Since k is a constant (the nodes-per-bucket) so is log k. Double the number of nodes in a bucket and it'll have twice the resolution of the given keyspace region and thus will (probabilistically) yield a node that is one bit closer to the target than a bucket with k/2 size. I.e. you have to double the number of entries per bucket for each additional bit per hop you wish to save.
Edit: Here's a visualization of an actual single-homed bittorrent DHT routing table, sorted by their prefixes, i.e. not relative to the local node ID:
Node ID: 2A631C8E 7655EF7B C6E99D8A 3BF810E2 1428BFD4
buckets: 23 / entries: 173
000... entries:8 replacements:8
00100... entries:8 replacements:0
0010100... entries:8 replacements:2
0010101000... entries:8 replacements:4
00101010010... entries:8 replacements:7
001010100110000... entries:8 replacements:3
0010101001100010... entries:8 replacements:3
00101010011000110000... entries:8 replacements:1
001010100110001100010... entries:3 replacements:0
0010101001100011000110... entries:6 replacements:0
0010101001100011000111... entries:6 replacements:0
0010101001100011001... entries:8 replacements:2
001010100110001101... entries:8 replacements:1
00101010011000111... entries:8 replacements:2
00101010011001... entries:7 replacements:0
0010101001101... entries:8 replacements:0
001010100111... entries:8 replacements:0
001010101... entries:8 replacements:1
00101011... entries:7 replacements:0
001011... entries:8 replacements:0
0011... entries:8 replacements:8
01... entries:8 replacements:8
1... entries:8 replacements:8
Upvotes: 4