Reputation: 1
I am trying to grasp best practices around representing data lineage as a graph (specifically, a DAG), and storing the values in something like neo4j.
For example, I have a multi-step processing pipeline - a recommendation engine with various input values, intermediate values, and a final output score. I'd like to represent the history of any given score as components of its previous values, each node representing a pure function.
The graph database would not be responsible for the calculations themselves, only for representing the inputs to each pure function represented by a node. Let's assume some nodes are computationally expensive to calculate, so persisting the intermediate values make sense. As any value in the graph changes, the child nodes could be marked as stale for some process to recalculate them asynchronously.
For those that have seen this architecture, what are some best practices surrounding this approach and is a graph database the right place to store these data dependencies?
Upvotes: 0
Views: 202
Reputation: 469
Yes, a graph database structurally mirrors a DAG and is a perfect place to store one as you described. As to best practices, it depends on which database. Neo4j is a directional property graph, and you can assign properties and values to nodes and relationships ("vertices" and "edges" in graph-speak).
So for your pipeline, a good first bet is that each node represents a function, and can include such properties as function identifier and version, execution time, server config, etc.
Each relationship represents the message from one function to another. You have a choice, and which way you go depends on the queries. Based on your description, I might keep the input and output parameter values in the function. But you could also describe these as the payload of the relationship. The latter is a truer representation of the pipeline, but would require some inconsistency if you needed to store your final output.
Upvotes: 1
Reputation: 1109
When I've had to roll-my-own for something like this, I add a dateUpdated column to my database to make it obvious when I updated. You could then walk down the node's children list to make sure that a child was always updated after it's parent. I'm assuming:
then you could one-time partition the graph into disjoint segments (once), within each segment - sweep all the roots into a pile, and process each root, and root's children, updating the dataUpdated column as you need to update the node.
If you want to keep the complete history, do an INSERT when you "update" dateUpdated (copy everything that doesn't change, set dateUpdated = NOW.) If you want to represent only the current processed state, then do an UPDATE.
One problem with doing an UPDATE is if you have other processes reading this data, you could have a race condition in which you update a parent, read an un-updated child, and get a stale value. The other approach is that you don't actually give the value back until all children have updated times greater than their parents (or whatever inputs are in your function.) If you do an INSERT, then the child nodes won't exist until they're processed, but you'll have to handle the case where you query a node that doesn't exist due to a rebuild, versus the case where you query a node that just doesn't exist.
If topology changes, then you'll likely need to enforce the DAG constraint, and then do the partitioning each time.
So, I think this sort of depends on what kind of history you want to keep (complete history, or precomputed value history) and how your graph topology changes over time.
I'm pretty good at three of the four tags you added, but I know nothing about neo4j, so sorry if there's already a data structure for this there. It also always seems like there should be some database system that can store data as it's changed, at different revisions, but I always fall back on dateUpdated with MySQL...
Upvotes: 0