Reputation: 1889
I'm considering adding the PH-tree to ELKI. I couldn't find any tutorials for examples for that, and the internal architecture is not fully obvious to me at the moment.
Some context: The PH-tree is a spatial index that was published at SIGMOD'14: paper, Java source code is available here. It is a bit similar to a quadtree, but much more space efficient, doesn't require rebalancing and scales quite well with dimensionality. What makes the PH-tree different from the R*-Tree implementations is that there is no concept of leaf/inner nodes, and nodes don't will not directly map to pages. It also works quite well with random insert/delete (no bulk-loading required).
Upvotes: 3
Views: 229
Reputation: 8715
Of course it would be nice to have a PH-tree in ELKI, to allow others to experiment with it. We want ELKI to become a comprehensive tool; it has R-trees, M-trees, k-d-trees, cover-trees, LSH, iDistance, inverted lists, space-filling-curves, PINN, ...; there are working-but-not-cleaned-up implementations of X-tree, rank-cover-trees, bond, and some more.
We want to enable researchers to study which index works best for their data easily, and of course it would be nice to have PH-tree, too. We also try to push the limits of these indexes, e.g. when supporting other distance measures than Euclidean distance.
The effort depends on how experienced you are with coding; ELKI uses some well-optimized data structures, but that means we are not using standard Java APIs in a number of places because of performance. Adding the cover tree took me about one day of work, for example (and it performed really nicely). I'd assume a more flexible (but also more memory intensive) k-d-tree would be a similar amount of work. I have not studied the PH-tree in detail, but I'd assume it is slightly more effort than that. My guts also say that it won't be as fast as advertised. It appears to be a prefix-compressed quadtree. In my experiments bit-interleaving approaches such as required for Hilbert curves can be surprisingly expensive. It also probably only works for Minkowski metrics. But you are welcome to prove me wrong. ;-)
You are always welcome to as for help at the mailing list, or here.
I would do an in-memory variant first, to fully understand the index. Then benchmark it to identify optimization potential, and debug it. Until then, you may not have figured out all the corner cases, such as duplicate points handling, degenerated data sets etc.
Always make on-disk optional. If your data fits into memory, a memory-only implementation will be substantially faster than any on-disk version.
When contributing to ELKI, please:
If you are looking for data mining project ideas, here is a list of articles/algorithms that we would love to see contributed to ELKI (we keep this list up to date for student implementation projects):
http://elki.dbs.ifi.lmu.de/wiki/ProjectIdeas
Upvotes: 1