PhD
PhD

Reputation: 11334

How to return a node, uniformly at random, from a binary search tree?

Given a BST (may or may not be balanced) how can one return "any" node uniformly at random? A constraint is you cannot use an external indexing data structure. You must traverse the tree in such a manner that every node has an equal chance of being visited.

This question has me perplexed for quite a while. If we can indeed use an external hashtable/pointers we could just randomize on those and return the corresponding node. However, my colleague has put forth a rather complex variant of the question, where no additional data structures can be used.

Update: You cannot have an inorder walk and store the result in an array either.

How can one achieve such a traversal?

Upvotes: 2

Views: 2626

Answers (3)

rici
rici

Reputation: 241881

Walk the tree in any order, keeping the following values:

  • N: the number of nodes seen

  • selected: the currently selected node.

Initially, N is 0 and selected is None. Visiting a node consists of the following:

  1. Increment N

  2. Generate a random integer in the range [0, N).

  3. If the random integer selected is 0, set selected to the current Node.

Note that the values N and selected need to be modified during the walk. That means that they are both input and output values to the visitor function.

At the end of the walk, N will be the number of nodes in the tree, and selected will be a random node selected with uniform probability (assuming you have a good random number generator).

This algorithm is not restricted to BSTs. It will work on any tree of any shape. In particular, it will work on a simple linear sequence of objects of unknown length, corresponding to the well-known random selection algorithm which is to iterate over the objects, replacing the selected random object with the newly visited one with probability 1/N where N is the number of objects seen to date.

If you keep track of visited nodes, it will also work on any connected graph.

If you have a very large tree (or graph), perhaps spread over a number of servers and/or storage devices, you can use a different presentation of this algorithm, which provides for a certain level of parallelism (and also prevents the need to keep a global walk structure or pass values into the walk).

We assume that each node-server has direct access to k objects and indirect access to some known number of children servers. The algorithm allows for redundant children, but assumes that network communication is (almost) perfect; dealing with network splits is outside of the scope of this answer. We also assume that every query has an associated unique query number, which allows us to deal with some networking artifacts. The query has no other information (other than the server to respond to), and is expected to return a tuple consisting of a count and a randomly-selected node.

When a node-server receives a query with id q, it does the following:

  1. If it has previously responded to query q, immediately return <0, null>

  2. Set count to k and selected to a randomly-selected object from the k objects it has direct access to.

  3. For each child server, send the query (with the same query-id)

  4. For each response returned (it doesn't matter what order the responses come in):

    a. Add response.count to count

    b. With probability response.count / count, replace selected with response.selected

  5. When all children servers have responded, return <count, selected>

Upvotes: 5

user2566092
user2566092

Reputation: 4661

If you have a full (bottom level is completely full) binary search tree then it is possible to quickly (sub-linear time) do as you ask, because you know the structure of the tree. If you want me to post an answer that gives the solution for this case, let me know and I'll update my answer. However, if you just have a generic binary search tree of arbitrary shape, then it is impossible to uniformly sample a node without visiting the entire tree so that you know the shape.

Upvotes: 0

tmyklebu
tmyklebu

Reputation: 14215

If you know the number of nodes n in the tree, find the kth node in an inorder walk of the tree for some randomly-chosen k between 0 and n-1, inclusive. If you don't, you can walk the tree and figure out its size and then do the above.

If the nodes of the tree are stored contiguously in some array, randomly sample an array element and return it.

If each tree node can tell you the size of the subtree rooted there, figure out how many nodes are in the tree, generate a random k up to the number of nodes in the tree, and select the kth element of the tree.

In all cases above, the "tree" part is a red herring.

There is a random walk on any nontrivial connected graph whose stationary distribution is uniform. Select a neighbour of the current node. If it has lower or equal degree, go there. If it has higher degree, go there with probability cur_deg / that_deg and stay put otherwise. Sampling from this random walk is called in various contexts "Gibbs sampling" and "Metropolis-Hastings."

Upvotes: 1

Related Questions