Reputation: 3595
A commonly cited benefit of B-trees is that the degree of branching can be high, which is useful in limiting the number of disk access required to reach a node.
However, suppose we have a (k, 2k) B-tree and naively implement the nodes. Search is actually going to be in
log( n ) * k / log(k)
One might instead opt to represent the values inside the nodes in nested, balanced trees, so that insertion and deletion of keys in those nodes will stay in log(k) and search will remain in log (n) even for very large k.
Are there papers suggesting this approach or implementations following it, or is the branching factor k generally too low to make it worth the hassle?
Upvotes: 1
Views: 216
Reputation: 372764
This is an interesting idea, but typically you don’t see this done. There are a couple of reasons why.
For starters, when you’re doing an insertion or deletion on a B-tree, the main bottleneck is the number of I/O transfers performed rather than the amount of CPU work. In that sense, the cost of shifting over a bunch of array elements is likely not going to be all that large a fraction of the cost of the B-tree operation.
Next, there’s the question of the frequency of these operations. If you use a B+-tree, in which the keys are stored purely in the leaves and each internal node just stores routing information, the overwhelming majority of insertions and deletions won’t actually make changes to the internal nodes of the tree and will just touch the leaves (figure only roughly one of every 2k insertions needs to split a node, and only one of (2k)2 operations needs to split two nodes, etc.). That means that there aren’t that many array edits in the first place, so speeding them up wouldn’t necessarily be worthwhile.
Another major issue you’d run into is how you’d allocate the memory for those tree nodes. B-trees and the like are optimized for storage on disk. This means that a typical B-tree operation would work by paging in a block of raw bytes from disk into some buffer, then treating those bytes as your node. That in turn means that if you were to store a BST this way, the nodes in that BST couldn’t store conventional pointers because the memory addresses of the nodes, once loaded into memory, wouldn’t be consistent from run to run. That precludes using things like malloc
or new
, and you’d need to implement your own memory manager to divvy up the bytes of the disk page to your nodes. That’s going to introduce a lot of overhead and it’ll be tricky to get this to be time- and space-efficient enough to warrant switching to BSTs over a raw array.
There’s also the memory overhead of using BSTs compared to raw arrays. BSTs require two extra pointers / offsets per item compared with raw arrays, which decreases the number of keys you can cram into a mode. Since the main speed of a B-tree derives from fitting as many keys as possible into a node, this might eat into the advantage of B-trees in the first place.
Hope this helps!
Upvotes: 0