TilmannZ
TilmannZ

Reputation: 1889

Adding PH-Tree to ELKI

I'm considering adding the PH-tree to ELKI. I couldn't find any tutorials for examples for that, and the internal architecture is not fully obvious to me at the moment.

  1. Do you think it makes sense to add the PH-tree to ELKI?
  2. How much effort would that be?
  3. Could I get some help?
  4. Does it make sense to implement only an in-memory version, as done for the kd-tree (as far as I understand)?

Some context: The PH-tree is a spatial index that was published at SIGMOD'14: paper, Java source code is available here. It is a bit similar to a quadtree, but much more space efficient, doesn't require rebalancing and scales quite well with dimensionality. What makes the PH-tree different from the R*-Tree implementations is that there is no concept of leaf/inner nodes, and nodes don't will not directly map to pages. It also works quite well with random insert/delete (no bulk-loading required).

Upvotes: 3

Views: 229

Answers (1)

Erich Schubert
Erich Schubert

Reputation: 8715

Yes.

Of course it would be nice to have a PH-tree in ELKI, to allow others to experiment with it. We want ELKI to become a comprehensive tool; it has R-trees, M-trees, k-d-trees, cover-trees, LSH, iDistance, inverted lists, space-filling-curves, PINN, ...; there are working-but-not-cleaned-up implementations of X-tree, rank-cover-trees, bond, and some more.

We want to enable researchers to study which index works best for their data easily, and of course it would be nice to have PH-tree, too. We also try to push the limits of these indexes, e.g. when supporting other distance measures than Euclidean distance.

The effort depends on how experienced you are with coding; ELKI uses some well-optimized data structures, but that means we are not using standard Java APIs in a number of places because of performance. Adding the cover tree took me about one day of work, for example (and it performed really nicely). I'd assume a more flexible (but also more memory intensive) k-d-tree would be a similar amount of work. I have not studied the PH-tree in detail, but I'd assume it is slightly more effort than that. My guts also say that it won't be as fast as advertised. It appears to be a prefix-compressed quadtree. In my experiments bit-interleaving approaches such as required for Hilbert curves can be surprisingly expensive. It also probably only works for Minkowski metrics. But you are welcome to prove me wrong. ;-)

You are always welcome to as for help at the mailing list, or here.

I would do an in-memory variant first, to fully understand the index. Then benchmark it to identify optimization potential, and debug it. Until then, you may not have figured out all the corner cases, such as duplicate points handling, degenerated data sets etc.

Always make on-disk optional. If your data fits into memory, a memory-only implementation will be substantially faster than any on-disk version.

When contributing to ELKI, please:

  • avoid external dependencies. We've had bad experience with the quality of e.g. Apache Commons, and we want to have the package easy to install and maintain, so we want to keep the .jar dependencies to a minimum (also, having tons of jars with redundant functionality comes at a performance cost). I'm inclined to only accept external dependencies for optional extension modules.
  • do not copy code from other sources. ELKI is AGPL-3 licensed, and any contribution to ELKI itself should be AGPL-3 licensed, too. In some cases it may be possible to include e.g. public domain code, but we need to keep these to a minimum. We could probably use Apache licensed code (in an external library), but shouldn't mix them. So from a quick look, you are not allowed to copy their source code into ELKI.

If you are looking for data mining project ideas, here is a list of articles/algorithms that we would love to see contributed to ELKI (we keep this list up to date for student implementation projects):

http://elki.dbs.ifi.lmu.de/wiki/ProjectIdeas

Upvotes: 1

Related Questions