windoze
windoze

Reputation: 45

Persisting a hierarchical ordered list (flatfile/sql/nosql)

I want to store hierarchical ordered lists. One example would be nested todo lists. Another example would be XML. It would just be a tree where the children are in a order. For simplicity, entries are just strings of text.

The thing is that the list will be edited by the user, so it is important that the common operations are fast:

I can imagine how to do this in a data structure: entries are linked lists, if they contain children, they also point to the head of another linked list. There is a hash table linking entry id to the actual data.

However, I need to store the data, and I have no idea how to achieve this. I don't want to save the entire tree if only one element changes. What is the best way? Flat files/SQLs/NoSqls/voodoos?

Upvotes: 1

Views: 2988

Answers (2)

orangepips
orangepips

Reputation: 9971

Using a relational database is viable solution. For your needs - fast insert, update, delete - I'd use an Adjacency List with an additional customizations as such:

id 
parent_id
cardinality -- sort order for all nodes with the same parent_id
depth -- distance from the root node

Calculating cardinality and depth is either done with code or - preferably - a database trigger for any insert, delete or update. In addition, for retrieving an entire hierarchy with one SELECT statement, a hierarchy bridge table is called for:

id
descendent_id 

This table would also be populated via the same trigger mentioned above and serves as a means for retrieving all nodes above or beneath a given id.

See this question for additional detail around Adjacency List, Hierarchy Bridge and other approaches for storing hierarchical data in a relational database.

Finally to provide some additional clarification on the options you listed:

  • Flat Files: a combination of linked lists and memory mapped files would probably serve, but you're really just rolling your own at that point, where a SQL or NoSQL solution would probably do better.
  • SQL: this would be my approach - tooling is the best here for data manipulation, backup and recovery.
    • XML: this is also a possibility with a database, very vendor specific, you'll need to study the syntax for node insert, update and delete. Can be very fast if the database offers an XML data type.
  • NoSQL: if you're talking key-value storage, the typical approach for hierarchical data appears to be materialized path, but this would require recalculating the entire path for all affected nodes on change, which is likely slow. Instead consider the Java Content Repository (JCR) - Apache Jackrabbit is an implementation - entire API centered around representing hierarchical structured data and persisting it - perhaps too heavyweight for the problem you're trying to solve.
  • voodoo: um...

Update

If you implement all pieces from this answer, add is cheap, re-sort is small cost, move is expensive. Trade-off is fast hierarchy traversal reads - for instance find a node's complete ancestry in one operation. Specifically, adding a leaf is an O(1) operation. Re-sort means updating cardinality all peer nodes coming after the moved node. Move means update of (1) cardinality for source and destination peer nodes coming after, (2) moved - and descendant - node depth, and (3) removal and addition of ancestry to hierarchy bridge table.

However, go with an Adjancency List alone (i.e. id, parent_id) and write becomes cheap, reads for one level are cheap, but reads that traverse the hierarchy are expensive. The latter would then require using recursive SQL such Oracle's CONNECT BY or Common Table Expressions as found in SQL Server and other RDBMSs.

Upvotes: 1

9000
9000

Reputation: 40894

You store lists (or rather trees) and don't want to rewrite the entire tree once a small piece of it changes. From this I conclude the stuctures are huge and small changes happen relatively often.

Linked lists are all about pointer chasing, and pointers and what they reference are much like keys and values. You need to efficiently store key-value pairs. Order of items is preserved by the linked list structure.

Suppose that you use a typical key-value store, from xDBM or Berkeley DB to any of modern NoSQL offerings. Also you could take a compact SQL engine, e.g. sqlite. They typically use trees to index keys, so it takes O(logN) to access a key, or hash tables that take about as much or a bit less.

You haven't specified when you persist your data incrementally. If you only do it once in a while (not for every update), you'll need to effectively compare the database to your primary data structure. This will be relatively time-consuming because you'll need to traverse the entire tree and look each node ID in the database. This is logarithmic but with a huge constant because of necessary I/O. And then you'll want to clean you persistent store from items that are no longer referenced. It may happen that just dumping the tree as JSON is far more efficient. In fact, that's what many in-memory databases do.

If you update your persistent structure with every update to the main structure, there's no point to have that main structure anyway. It's better to replace it with an in-memory key-value store such as Redis which already has persistence mechanisms (and some other nice things).

Upvotes: 1

Related Questions