templatetypedef
templatetypedef

Reputation: 372734

How do you keep a binary search tree balanced?

The runtime of most operations on binary search trees depends on the height of the tree. If the tree is nicely balanced, the cost of an insertion, deletion, lookup, successor, predecessor, minimum, or maximum query is O(log n). However, if the tree isn't balanced, the costs of these operations can go up as high as O(n).

How can you keep a binary search tree balanced as elements are inserted and deleted?

Upvotes: 4

Views: 3701

Answers (2)

AYUSH PATEL
AYUSH PATEL

Reputation: 1

code to make normal bst to balanced bst call the given function


struct Node* solve(int s, int e, vector<int>& inorder) {
    if (s > e) return NULL;

    int mid = (s + e) / 2;
    struct Node* node = new Node(inorder[mid]);
    node->left = solve(s, mid - 1, inorder);
    node->right = solve(mid + 1, e, inorder);

    return node;
}

class Solution {
public:
    Node* buildBalancedTree(Node* root) {
        vector<int> inorder;
        stack<Node*> st;
        Node* node = root;

        while (true) {
            if (node != NULL) {
                st.push(node);
                node = node->left;
            } else {
                if (st.empty()==true) break;

                node = st.top();
                st.pop();
                inorder.push_back(node->data);
                node=node->right;
               
            }
        }

        int n = inorder.size();
       // for(int i=0;i<n;i++) cout<<inorder[i]<<" ";
        Node* head = NULL;
        head = solve(0, n - 1, inorder);
        return head;
    }
};


Upvotes: 0

templatetypedef
templatetypedef

Reputation: 372734

There are many, many ways to keep binary search trees balanced, each of which introduces a different set of tradeoffs. Generally speaking, balanced binary search trees fall into one of these categories:

  • Height-Balanced Trees: Trees that attempt to keep height differences between different parts of the tree somewhat equal.

  • Rank-Balanced Trees: A recently-developed generalization of height-balanced trees where each node is given a rank, and nodes try to keep rank differences between themselves and their parents low.

  • Weight-Balanced Trees: Trees that attempt to keep the number of nodes in different regions of the tree somewhat equal.

  • Randomized Trees: Trees that randomize their shape and attempt to thereby keep the overall height low.

  • Static Trees: Trees designed to take on a specific shape that is good for a particular set of queries.

  • Self-Adjusting Trees: Trees that reshape themselves in response to accesses to keep lookup costs low.

Here's a quick rundown of these different strategies, along with some examples of different trees of each type.

Height-Balanced Trees

Height-balanced trees, intuitively, work by imposing structural constraints that ensure that certain subtrees have heights that can't differ by "too much," for some definition of "too much." They keep the overall tree height low by ensuring that the tree can only grow to a certain height if a large number of nodes are present. Several of the most-commonly-used trees fall into this category. For example:

AVL Trees

AVL trees, named after the initials of their inventors, are the original balanced binary search tree data structure, invented in 1962. AVL trees are binary trees that obey the following structural constraint: each node's two subtrees can have a height difference of at most one. This is a tight structural constraint: any AVL tree of height h has between Fh+2 and 2h, where Fn is the nth Fibonacci number.

To maintain this requirement, AVL trees perform tree rotations whenever an insertion or deletion creates a subtree whose left and right subtrees have a height difference of ±2.

Because of the tight structural constraints, AVL trees tend to have very low heights relative to the number of nodes. However, this also means that the number of rotations performed on an insertion or deletion may be high as a single insertion or deletion can change the relative heights of many nodes' subtrees.

There are several modern variants of AVL trees. The RAVL tree (relaxed AVL tree) generalizes AVL trees by allowing for greater degrees of imbalance after deletions, reducing the amount of work required during each insertion or deletion operation.

Red/Black Trees

Red/black trees are binary search trees in which each node is assigned a color (red or black) according to a strict set of rules:

  • The root node is black.
  • No red node has a red child.
  • Any path starting at the root of the tree and walking off the tree passes through the same number of black nodes.

That last rule is the most subtle. It means that if you begin at the root node and walk left or right however you'd like, at the point that you step off the tree the number of black nodes visited will always be the same regardless of which left/right choices you made.

These rules ensure that the deepest leaf node is at most roughly twice as deep as the shallowest leaf node. Intuitively, that's because the extreme case would be to have one leaf node reachable by a path consisting purely of black nodes and another leaf reachable by a path that alternates black/red/black/red/..., since red nodes can't have red children. A more detailed analysis shows more strongly that the height of the tree is guaranteed to be O(log n).

Insertions and deletions in a red/black tree are accomplished by doing normal insertions or deletions, followed by a series of rotations and color changes to ensure that the above rules are met. Unlike AVL trees, red/black trees generally make few rotations and do little "fixup" work after an insertion or deletion. Specifically, the amortized amount of fixup work required per insertion or deletion is O(1), so most insertions and deletions will do the regular O(log n) tree operation plus a very small amount of added work. As a result, while red/black trees tend to be taller than AVL trees, they're a bit faster in workflows that have large numbers of insertions and deletions.

AA Trees

AA trees are a style of height-balanced trees closely related to red/black trees.

Both red/black trees and AA trees are related to a family of height-balanced multiway search trees called B-trees. Intuitively, B-trees are multiway trees in which each node can store (roughly) b to 2b keys for some external parameter b. They work by performing insertions into leaf nodes, then splitting larger leaves and "kicking" keys higher into the tree when the size limit is exceeded.

The red/black tree can be thought of - and indeed was invented by - modeling a B-tree in which each node holds 1, 2, or 3 keys (a 2-3-4 tree). The idea is that each black node in a red/black tree corresponds to a node in the 2-3-4 tree, and each red node in a red/black tree represents a key that is "pulled up" into the black node above it. AA trees, on the other hand, are modeled after B-trees in which each node has 1 or 2 keys (a 2-3 tree), using a similar set of techniques. AA trees also enforce a rule that "red" nodes must hang to the left of a black node that they're pulled up into. This reduces the number of cases to check during an insertion or deletion, but also increases the number of rotations that may need to be performed.

Left-Leaning Red/Black Trees

A "hybrid" between the classical red/black tree and the AA tree is the left-leaning red/black tree. This tree structure, like the red/black tree, encodes a 2-3-4 tree as a binary search tree. As the name suggests, though, in the case where a black node has exactly one red child, that red child must hang to the left of its black parent.

This reduces the number of cases that can arise in an insertion or deletion, but, like AA trees, increases the number of rotations that must be performed during tree edits.


Rank-Balanced Trees

Rank-balanced trees assign a number called a rank to each node, then enforce a set of rules to ensure that rank differences between nodes and their children are not "too large." The original paper on rank-balanced trees shows that this family of trees includes AVL trees and (with minor edits) also includes red/black trees.

WAVL Trees

The WAVL tree (weak AVL tree) is the most famous of the rank-balanced trees. It works by assigning nodes ranks so that each node's rank can be at most two larger than its children's ranks. WAVL trees are identical to AVL trees when elements are never deleted from the tree, and can always be colored according to red/black rules and are therefore never worse in height/shape than red/black trees. Like red/black trees but unlike AVL trees, WAVL trees require at most a constant number of rotations to be performed on each insertion or deletion.


Weight-Balanced Trees

Weight-balanced trees aim to keep the overall height of the tree low by ensuring some "nice" relationship between the number of nodes in each node's left and right subtrees. The basic idea is that if each node splits the remaining nodes into some nice fraction (say, 75% / 25%), then each step down the tree causes the size of the current subtree to decay geometrically, ensuring that the tree has logarithmic height.

BB[α] trees

BB[α] trees (trees of bounded balance, parameter α) are binary search trees in which each node's subtrees have a "weight" that is always at least an α fraction of their parent's "weight." (In BB[α] trees, the weight of a node is given by the total number of nodes in its subtree, plus one.) As α gets closer and closer to 1/2, the relative sizes of the left and right subtrees must get closer and closer together. This means that more work must be done to maintain the tree shape, but the overall tree height gets lower. As α gets smaller, the relative sizes of the left and right subtrees are less constrained, meaning less work is done to insert or remove elements but the tree height gets larger and larger.

Like all the trees mentioned above, BB[α] trees use tree rotations to reshuffle nodes after an insertion or deletion to maintain their balance condition. The original version of BB[α] trees had an upper bound on the choice of α of around 0.25, meaning that each step in the tree would guarantee that least 25% of the remaining nodes would no longer be in the currently-searched subtree.

Scapegoat Trees

Scapegoat trees are a hybrid between a weight-balanced and height-balanced tree. The tree itself is a weight-balanced tree in that there's a parameter α (no relation to the α parameter from BB[α] trees) such that the size of each node's two subtrees is at most α times the size of the node itself. Here, the "size" of a node is the number of nodes in its subtree.

Unlike the aforementioned balanced tree types, scapegoat trees do not (directly) use rotations to perform their rebalancing. Instead, whenever an insertion is performed that makes the tree "too tall" to be weight-balanced, it searches backwards along the insertion path to find a node that isn't properly weight-balanced, then rebuilds that entire subtree to be perfectly-balanced. In that sense, while the shape of the tree is that of a weight-balanced tree, the strategy for rebalancing works by looking for height-balance violations.

This approach does not guarantee worst-case O(log n) performance on an insertion or deletion due to the cost of optimally rebalancing the violating subtree. However, it does give an amortized O(log n) cost per insertion or deletion, since it's rare to need to do a large rebuild and, after a large rebuild is done, the tree ends up perfectly balanced.

The actual logic to rebuild the bad subtree can be done in linear time using only O(1) auxiliary storage space via the Day-Stout-Warren algorithm, which optimally rebuilds a BST to be perfectly-balanced using a clever set of tree rotations.

Scapegoat trees are often used as building blocks in larger data structures in which rebalancing via rotation is not an option. For example, scapegoat trees can be combined with k-d trees to form dynamic k-d trees, since normal BST rotations in a k-d tree aren't permitted.


Randomized Trees

Randomized trees work by choosing a random tree shape subject to certain rules. Because most randomly-chosen binary search tree shapes have low height (it's very unlikely that you'll get a long chain of nodes), these trees have a high probability of being balanced.

Treaps

Treaps are, as the name suggests, a hybrid between a binary search tree and a binary heap (or, more accurately, between a binary search tree and a Cartesian tree). Each node in a treap is annotated with a uniformly-random weight (say, a random 32-bit integer, or a random real number between 0 and 1), and the nodes are arranged such that

  • the nodes form a binary search tree with respect to the keys in the treap, and
  • every node's weight is less than or equal to its children's weights.

These two properties uniquely determine the shape of the treap; in fact, for any set of (distinct) keys and weights there is exactly one treap holding those keys and weights.

A useful perspective for understanding treaps is to imagine running a randomized quicksort on the keys stored in the tree. At the first round of quicksort, we pick a random pivot (imagine picking the key with the lowest weight), then reorder the elements so that smaller elements go to the left of the pivot (into the left subtree) and larger elements go to the right of the pivot (into the right subtree). We then recursively sort those elements (recurisvely build the rest of the tree). As a result, by the same analysis that shows that the total cost of randomized quicksort is expected O(n log n), the expected depth of any node in the treap is O(log n).

Insertions and deletions into a treap can be performed using very simple tree rotations. An insertion is done by inserting as usual, then rotating the node with its parent until its weight exceeds its parent's weight. Deletions can be done by rotating the node with its lower-weight child until the node becomes a leaf, then deleting the node.

Zip Trees

Zip trees are an alternative to treaps that require fewer random bits per node. Like treaps, each node is assigned a random weight, though this time from a geometric distribution rather than a uniform distribution. The rule is that each node's weight must be greater than the weights of its children, with the exception that if there's a tie in ranks the tied node must be its left child. These rules, like treaps, are preserved by performing rotations whenever a node is inserted or deleted, or performing an equivalent operation called zipping or unzipping that simulates the rotations without actually performing them.

Zip trees were invented as a way of encoding a skiplist as a randomized binary search tree. They tend to be slightly taller on expectation than treaps, but due to the use of geometric rather than uniform random variables require fewer random bits per node (treaps need roughly O(log n) bits per node; zip trees need roughly O(log log n) bits per node.)

Zip-Zip Trees

Zip-zip trees can be thought of as a hybrid between zip trees and treaps. Like a zip tree, each node in a zip-zip tree has an associate weight drawn from a geometric distribution, and like a zip tree, each node’s weight is bigger than its left child’s weight and less than or equal to its right child’s weight. However, each node also is assigned a second weight value that is uniformly random over a small range, and when comparing weights, if two nodes have the same primary weight, their secondary weights are considered as a tiebreaker. This has the effect of taking long runs of nodes with the same weight and reshaping the chain into a nice tree shape. As a result, zip-zip trees use only O(log log n) random bits per node, but have the same height guarantees as a treap - almost the best of both worlds!


Static Trees

Static binary search trees are binary search trees that don't allow for insertions or deletions at all. They're typically used in cases where the access probabilities of each node are known or can be estimated in advance.

Statically Optimal BSTs

Statically optimal BSTs are binary search trees specifically constructed to minimize the expected cost of a lookup in the tree, assuming the access probabilities of each node are known in advance. For example, if you were building a BST to store contact information inside a phone and knew which people were mostly likely to be looked up, you might structure the BST to place the commonly-called people in the tree higher up and the less-frequently-called people down lower in the tree.

Don Knuth found an O(n2)-time algorithm for building an optimal binary search tree given the access probabilities of each node. The algorithm is a clever dynamic programming solution that works on the following insights. First, some node - we’re not immediately sure which - must go at the root. And given any choice of the root node, we’d then build optimal binary search trees for the left and right subtrees of the root, which correspond to the elements less than and greater than the root, respectively. This means that there are only O(n2) possible subproblems to consider, corresponding to each consecutive subrange of the elements to store in the tree. Naively, it’ll take time O(n) to determine the solution to any of these subproblems because in each subrange there are O(n) nodes to try as the root. However, Knuth showed that there’s some clever structure to how these pivot choices work that lets the overall evaluation complexity work out to O(n2).

It was later proved that the cost of a lookup in such a tree is O(1 + H), where H is the Shannon entropy of the probability distribution of the keys. This quantity H ranges from zero (all accesses are to a single key) through log n (all keys have an equal chance of being looked up) depending on how skewed the distribution is.

Weight-Equalized Trees

Weight-equalized trees, sometimes confusingly called weight-balanced trees, are a type of static tree constructed according to a simple rule. The root node is chosen so that the sum of the access probabilities of the left and right subtrees are as close as possible, and those subtrees are recursively constructed in the same way.

The above rule says “equalize the weights of the left and right subtrees as much as possible,” and so it’s not particularly surprising that trees built these way are weight-balanced with respect to the total probability mass of each subtree. Specifically, you can prove that each subtree has at most 2/3 of the probability mass of its parent tree. With a little bit more math you can prove that the cost of lookups in these trees is O(1 + H), within a constant factor of the expected lookup cost of Knuth's optimal trees.

Naively, it would take time O(n2) work to build a weight-equalized tree: you could try each node as the potential tree root and recursively build weight-equalized trees for the the left and right subtrees. However, it’s possible to speed this construction time up to O(n log n) for a set of unsorted keys by sorting the keys and using a clever binary search to find the optimal root. Later work showed that this can be improved even more to construction time O(n) from a set of sorted keys by using a very clever two-sided exponential search.


Self-Adjusting Trees

Self-adjusting trees attempt to achieve good runtimes in a different way - by dynamically restructuring themselves in response to queries. By adapting to the queries made of them, they often can, either practically or theoretically, outperform standard balanced trees in cases where there is some nice structure to the queries being made.

Splay Trees

Splay trees are the most famous of the self-adjusting search trees. A splay tree is a regular binary search tree with one twist - whenever a node is inserted, deleted, or looked up, that node is moved up to the root via a process called splaying. A splay operation is performed by repeatedly looking at a node, its parent, and its grandparent, then deciding on a series of rotations that move the root closer to the root. The cases are called zig, zig-zag, and zig-zig and are fairly straightforward to implement.

Beyond this rule, splay trees enforce no constraints whatsoever on their shape. This means that splay trees can become highly imbalanced in a conventional sense. However, the splay operation has some astonishing properties that make the splay tree incredibly fast in an amortized sense. Specifically:

  • The amortized cost of looking up an element is O(log n).
  • The amortized cost of looking up an element is O(1 + H), where H is the Shannon entropy of the distribution of accesses across nodes. Stated differently, splay trees can be used as a replacement for static trees.
  • The amortized cost of looking up an element is O(log t), where t is the number of different items that have been accessed since the last time the item queried has been looked up. In other words, if at each point in time there's a set of "hot" items in the tree, the cost of lookups depends only on how many hot items there are, not how many items exist in the tree.
  • The amortized cost of looking up an element is O(log Δ), where Δ is the rank difference between the queried item and the last queried item. That is, if you imagine that the tree stores a sorted array of elements, the cost of a lookup depends only on how many steps away in the array you are from the last queried item, not how many total items there are.

It's suspected, but not proven, that splay trees are dynamically optimal, in the sense that no other self-adjusting BST can outperform a splay tree over any sufficiently long access sequence.

However, the overhead of performing rotations per operation, combined with the fact that splay trees don't play well with concurrency and their guarantees being only in an amortized sense, means that splay trees are not commonly used as a "standard" BST implementation.

Tango Trees

Tango trees are a binary search tree that consists of several different red/black trees hanging off one another in a way that changes per access. Tango trees aim for efficiency in a very different way than the other trees here: they're built to guarantee that the cost of any sequence of operations on a Tango tree takes at most O(log log n · c*), where c* is the best possible cost of performing that sequence of operations on any balanced BST structure.

More specifically, the tango tree works by envisioning a reference binary tree (not actually built anywhere) with the tree contents as leaves. Each node in the tree has a preferred child, which causes the tree to split the edges apart into paths called "preferred paths." The Tango tree stores each of these paths as a red/black tree, with non-preferred edges used to link each red/black tree to a child red/black tree. On a lookup, the preferred children in the reference tree are changed such that the key looked up is on a preferred path down from the root, and the red/black trees are restructured to match the resulting paths.

By using splay trees instead of red/black trees in a tango tree, we get the multisplay tree, which also performs its operations in time O(log log n · c*), but also guarantees amortized O(log n) time per lookup along with several other nice properties (for example, the cost of sequentially looking up each item in a multisplay tree is O(n)).


More to Explore

There are many other beautiful data structures out there that I didn't have time to include here in full detail. Here's a sampler of other ones worth looking up:

  • B-trees are used extensively in databases and file systems, as well as inspirations for and building blocks in other data structures. The red/black tree and AA tree are both designed as encodings of specific B-trees as binary search trees.

  • Skiplists are an alternative to balanced BSTs that work by running multiple hierarchical linked lists through a collection of items. The original skiplist data structure was randomized and guaranteed O(log n) expected-time operations (this structure, adapted into a BST, gives the zip tree). Later work produced deterministic skiplists that work by modeling 2-3-4 trees, making them essentially identical to red/black trees except with a totally different representation.

  • Iacono's working set structure uses a collection of balanced BSTs to store items in a way that guarantees that lookups of more recently queried items run faster than lookups of older items. It's a building block in Iacono's unified structure, which makes the cost of looking up items that are near recently queried items (in a technical sense) much faster than normal.

  • Geometric Greedy, whose actual name is a bit too colorful for Stack Overflow, is a type of BST that's conjectured to be "as good as it gets" for binary search trees. It's a self-adjusting tree that looks at past access patterns to restructure the tree to minimize the number of nodes touched per lookup. Whether this is actually an optimal BST remains to be seen.

  • Finger search trees are BSTs restructured around a common access point called a finger, with queries to items near the finger running much faster than queries to items further away from the finger.

Hope this helps!

Upvotes: 13

Related Questions