Reputation: 1854
Since I'm using PostgreSQL there is a module which is called ltree, which satisfies at least one of my needs, performance (I don't know about scalability? Someone says materialized path trees does not scale well..).
Since the application I'm developing is a CMS built entirely around a big tree, nodes, subtrees etc performance in queuering these nodes is absolutely essential, but since it's a hiearchical large (as it grows) tree you're working on and manipulating from the GUI (CRUD), I also want to make it possible for users to drag and drop to reorder nodes, subtrees etc while updating the tree (child records) in the database correctly.
As I understand moving and reordering nodes/subtrees in a tree is not really what ltree/materialized path trees are good for, so what I hope you can help be with is to either point me to the correct tree-structure-model that is best for performance AND moving subtrees and nodes, or perhaps... if ltree is indeed not a leftover from the past but worth still using, how could you achieve this with PostgreSQL's ltree module? And why/why not use ltree in this case?
Requirements:
I'm also considering the Closure tables aka Bridge tables (alot!), Nested Intervals (not sure I understand exactly how to implement it, and no good examples or gists currently exists?) or B-tree models. I'm just not quite sure yet, how these will satisfy my above 4 requirements. Re-organizing subtrees and nodes in nested intervals seems straightforward and performance seems good.. Quite hard to choose the right one to go with.
Since I definitely need performance (query / read performance), scalability, sorting I kinda thought that Closure tables WITH sort order could be very close, but I just cant imagine how big the closure tables and disk-space-overhead will become as my tree and nodes grow large. Closure tables and scalability, I'm just not too sure of. Am I wrong in worrying about this, and what might the best solution for this task be?
Upvotes: 3
Views: 6660
Reputation: 78561
The typical data structures used to index trees stored in SQL are designed and optimized for read performance on sets that don't change often.
As an example, if you're using the nested set model, adding or deleting a node would involve updating the entire tree (which typically means rewriting the entire table): great for reads, not so great for writes.
When write performance is important for you, you'll usually be better off working on the raw (id, parent_id)
tuples with recursive queries, while setting tree indexes you know for sure are dirty to null as you go. In those areas of the app where read-performance is more important, do a sanity check by checking for null values in the tree index, and re-index the tree as needed before actually using it. That way, you'll avoid incessant rewrites of your tree, and instead re-index it only when needed for a read.
An alternative albeit (much) more difficult approach is to use a variation of e.g. nested sets or nested intervals, but using reals or floats instead of integers. This allows to insert, move and delete nodes for free, at the cost of some storage and arithmetic/read overhead and the loss of some properties such as child node counts in the case of nested sets. However, it also requires that you keep an eye out for pathological edge-cases. Namely you'll need to periodically -- and sometimes preemptively -- "garbage collect" and re-index large enough chunks of the tree's index in order to fit new nodes when you run into the floating point type's precision limits.
(A variation of the latter is to use a numeric without any precision in order to try to dodge the problem. But it's actually kicking the can down the road, in the sense that you'll still be limited by Postgres internals of a few thousand digits of precision. And the storage and arithmetic overheads became material compared to just using a floating point type long before you run into that limit in my own tests from a few years back.)
As for a "The Best" structure or approach, there really is no magic bullet... Each has pros and cons based on the use-case (frequency of reads vs writes) and the size of the set. There's plenty of literature on the web that compare and explain each of them, which I'm sure you've found already.
That being said, for a CMS I'd advise that you go with whichever method you're most comfortable with. Either re-index the tree on the fly as writes occur, or mark the tree as dirty on writes and then re-indexing it on demand. The point here is that, if re-indexing is done right (= using a plpgsql function or equivalent, rather than a gazillion queries issued by your app), re-indexing an entire tree of a few hundred thousand nodes will a few hundred milliseconds at most. Assuming the tree isn't constantly getting updated, that's a perfectly acceptable overhead for end-users.
Upvotes: 5