irvan98
irvan98

Reputation: 1

Kd-Tree Insertion Order

I am using a KD-Tree to optimize Range searching on a set of 2D points (x,y). To save time, I try to use Java Topology Suite's KD-Tree. However, the javadoc states that:

Note that the structure of a KD-Tree depends on the order of insertion of the points. A tree may become imbalanced if the inserted points are coherent (e.g. monotonic in one or both dimensions). A perfectly balanced tree has depth of only log2(N), but an imbalanced tree may be much deeper. This has a serious impact on query efficiency

So my question is: How to insert points to a KD-tree in such a way that minimizes the height of the tree?

Upvotes: 0

Views: 331

Answers (1)

TilmannZ
TilmannZ

Reputation: 1899

In wouldn't worry too much about it. It is a bit like hash maps which have in theory worst case O(n) lookup if all entries have to happen the same hash code. In reality this very unlikely to happen unless there is something wrong with the hash function.

To avoid imbalance, you should make sure that your points are not ordered in any way. If they are, consider shuffling them before inserting them.

Also, some kd-tree implementations have a rebalance() function that can be called after inserting (the bulk of) the data. This will rebalance the tree internally.

Finally, if you really want to use a specific insertion order that avoids imbalance, you can do the following. Note that this approach is not optimal, it will avoid the worst case but will not typically result in a fully balanced tree: Sort the points by x-coordinate, then insert them using a binary split search. For example, if you have 15 points, sort them by 'x' to get a sorted list of point p0...p14. Then:

  1. Take the middle point in the range (p7) and insert it.
  2. For each side of the split point, create a new range: p0-p6 and p8-p15
  3. Start over with 1) with each of the two ranges

This results in the following insertion order:

  1. round: p7
  2. round: p3 and p11
  3. round: p1, p5, p9, p13
  4. round: p0, p2, p4, p6, p8, p10, p12, p14

Why is this not ideal?

  • We completely ignore the y-coordinate. This approach only avoids imbalance with respect to the x coordinate.
  • Typical kd-trees split alternatingly between x and y. However, we do not know which comes first in the given implementation so we can only pretend that it always splits on x. This does no immediate harm, but it prevents more optimal insertion.

Upvotes: 1

Related Questions