Reputation: 2106
My data set is made of many rectangles that lie on an x,y plane (represented by a set of quad points). 99.9% of the time these rectangles will not overlap, but very rarely they will. I'm trying to find an optimal data structure for storing the rectangles so that I can find instances of intersection.
By the way, the rectangles contain text, so I'm doing this in order to find occurrences of the same text overlapping. This is because occurrences like this should be treated as one rectangle of text instead of two.
For example: let's say I'm searching for the text "123". There are two rectangles. The first rectangle contains "TEST 123" and the second contains "123". If "123" overlaps with the "123" within the first rectangle (within a given threshold) then my search result should return only one occurrence of the text "123".
So far I have briefly taken a look at quadtrees, r-trees, k-d trees, and range trees. I do not know much about these trees, nor do I know if any would work for this problem. I feel like an r-tree would not be optimal in this case because the the chance of overlap is very small.
Upvotes: 2
Views: 1126
Reputation: 1889
I understand you don't want the index to do any text recognition, it should really only detect overlapping (axis-aligned) rectangles. This is sometimes called a 'spatial join' operation.
As far as I am aware there are very few dedicated algorithms, except maybe the TOUCH algorithm (an optimised R-Tree, I think). So I would use a brute force approach, performing for each rectangle one window query on your dataset.
There are many possible algorithms based on spatial indexes. It depends on what your requirements are (except that kd-trees usually only work for points, not for rectangles).
For on-disk, one usually recommends R-Tree variants, such as R*tree or X-tree. However, R-Trees tend to perform less well with updates, but are usually used with an initial bulk load. Windows queries in R-Tree tend to work better with larger result sets, but that may depend on the actual dataset.
Quadtrees should be okay for your 'sparse' dataset, they are also simple to implement, but require a lot of memory and are not ideal for on-disk usage.
If you are using Java, have a look at my PH-Tree, it works a bit like a quadtree, but is much more space efficient and performs very well with large datasets, supports updates, and has very fast window queries, especially if the result sets are small (0 or 1 result). It may be exactly what you need, except that it is somewhat complex to implement (my version is Java and Apache v2 licensed), and there is currently no efficient way to store it on disk.
Upvotes: 3