Techie
Techie

Reputation: 45124

How To Develop A Database Schema For A Tree Structure(Directed acyclic graph)

I'm using below tree structure and planning to develop a db schema for the below.

enter image description here

What I have development up to now is below,

enter image description here

The problem I'm having is if I search for Y, below tree should be generated.

enter image description here

Logic I'm using is, Y has two cross references X,Z, those two nodes should be in the diagram and the parent nodes all the way to the starting parent node.

Given that I'm using PHP to generate this tree using a mysql db table as shown above. DB structure can be changed. I searched on google for a similar tree structure but I couldn't find any help.

Note

I'm not asking for you to write code for me. All I'm asking is some guidelines how this should be done.

I found below helpful but still different to my scenario

What is the most efficient/elegant way to parse a flat table into a tree?

How to represent a tree like structure in a db

If anyone can please tell me what php libraries should I use to generate the tree and what's the suitable db structure to be used?

Upvotes: 17

Views: 5172

Answers (6)

Dmitri Zaitsev
Dmitri Zaitsev

Reputation: 14046

Let me restate the question just to make sure I understand it right. You have Nodes and two kind of relations - arrows and vertical lines. Then given a Node N, you want to generate a subset S(N) of Nodes with the following recursive rules:

  • Rule 0: N is in S(N).
  • Rule 1: If Node M is connected with N via vertical relation, it is also in S(N).
  • Rule 2: If Node M is in S(N), then any Node with arrow pointing to M is also in S(N).

The set S(N) is the minimal set of Nodes satisfying those rules.

The example your give with N being Y, seem to confirm these rules.

However, there is another different but (at least to me) more natural set of Rules, where Rule 1 above is replaced by

  • Rule 1': If Node M is in S(N), then any Node connected with M via vertical relation is also in S(N).

In the sequel I will assume Rules 0,1,2 confirmed by your example, but similar approach can be used for Rules 0,1',2 or any modification.


Also I understand there is a mistake in your table in row 7, it should be:

7    B2    B    B1,B3 (not B2,B3)

Now to the proposed solution.

I would first slightly modify your table structure: Since "id" is your primary key, the rule for a foreign key is to point to the primary key of the related entry. That is, in your case, I would replace "node_id" by "node_name" or something like that just not to confuse with "id", and replace entries of "node_parent_id" and "cross_ref" by their "id"s. For instance, the row number 15 would look like that:

15    'Y'    [13]    [14,16]

Alternatively, if you prefer for readability reasons, you can use A, B, X, Y etc. as primary keys, provided they are unique, of course, then your table would remain the same, with exception of the "id" field that is not needed. I will assume the first case, but you can adapt it to the second with simple replacement.

That is all you need, as far as the table goes.


Now you need a recursive function to generate the subgraph S(N) for each given Node N.

I will implement the set S as array $set of all 'id's of its Nodes. Then I will define two functions - one to expand the set initially based on Rules 1,2 and the other to expand the set subsequently based on Rule 2 only.

For simplicity, I will assume your table is imported into memory as associative array $rows of rows, such that $rows[$id] represents the row with 'id' equal to $id as, again, associative array. So

$rows[15] = array('id'=>15, 
                  'node_name'=>'Y', 
                  'node_parent_id'=>array(13), 
                  'cross_ref'=>array(14,16)
);

Here is the code for the functions:

function initial_expand_set ($node_id) {
    global($rows); // to access the array from outside
    $set = array($node_id);    // Rule 0
    $row = $rows[$node_id];    // Get current Node

    $parents = $row['node_parent_id'];  // Get parents of the Node
    $set = $parents ? array_merge ($set, $parents) : $set;   // Join parents to the set

    $vert_brothers = $row['cross_ref'];  // Get vertical relations
    $set = $vert_brothers ? array_merge ($set, $vert_brothers) : $set;

    $set = expand_set($set);  // Recursive function defined below
    return $set;
}

And the recursive function:

// iterate over nodes inside the set, expand each node, and merge all together
function expand_set (array $set) {
    global($rows);
    $expansion = $set;  //  Initially $expansion is the same as $set
    foreach ($set as $node_id) {
        $row = $rows[$node_id];    // Get Node 

        // Get the set of parents of the Node - this is our new set
        $parents = $row['node_parent_id'];  // Get parents of the Node

        // apply the recursion
        $parents_expanded = expand_set($parents);

        // merge with previous expansion
        $expansion = array_merge($expansion, $parents_expanded);
    }
    // after the loop is finished getting rid of redundant entries
    // array_unique generates associative array, for which I only collect values
    $expansion = array_values( array_unique($expansion) );
    return $expansion;
}

Hope it works for you. :)

If any further details or explanations needed, I will be happy to assist.

PS. For the pedants among readers note that I use space before '(' for function definition and no space for function calls, as recommended by Douglas Crokford.

Upvotes: 5

Tom Drake
Tom Drake

Reputation: 537

If I understand you correctly, you have a classic many-to-many self-referential logical table structure. This is handled easily by creating three tables: One to represent your 'nodes', another to represent the parent-child 'associations' between the nodes', and another to represent the sibling relationships. I'm not convinced you need to represent sibling relationships directly, as these can be inferred from the parent-child relationships. However, since you have not rendered 'green' relationship lines for all siblings, I'll assume that this is a some 'special' relationship. The tables/columns can be modeled as follows:

Node

  • node_id (pk)
  • . . .

Node_Map

  • parent_id (fk to node.node_id)
  • child_id (fk to node.node_id)

node_sibling_map

  • node_id (fk to node.node_id)
  • sibling_id (fk to node.node_id)

In order to populate the table you've described in this model, you'd need to issue the following. (quotes omitted).

  • insert into node (node_id) values(A);
  • insert into node (node_id) values(B);
  • insert into node (node_id) values(C);
  • insert into node (node_id) values(A1);
  • insert into node (node_id) values(A2);
  • insert into node (node_id) values(B1);
  • insert into node (node_id) values(B2);
  • insert into node (node_id) values(B3);
  • insert into node (node_id) values(A2B1);
  • insert into node (node_id) values(CB3);
  • insert into node (node_id) values(A2B1B2);
  • insert into node (node_id) values(X);
  • insert into node (node_id) values(Y);
  • insert into node (node_id) values(Z);
  • insert into node_map (parent_id, child_id) values(A,A1);
  • insert into node_map (parent_id, child_id) values(A,A2);
  • insert into node_map (parent_id, child_id) values(B,B1);
  • insert into node_map (parent_id, child_id) values(B,B2);
  • insert into node_map (parent_id, child_id) values(B,B3);
  • insert into node_map (parent_id, child_id) values(C,CB3);
  • insert into node_map (parent_id, child_id) values(A2,A2B1);
  • insert into node_map (parent_id, child_id) values(B1,A2B1);
  • insert into node_map (parent_id, child_id) values(B2,A2B1B2);
  • insert into node_map (parent_id, child_id) values(B3,CB3);
  • insert into node_map (parent_id, child_id) values(C,CB3);
  • insert into node_map (parent_id, child_id) values(A2B1B2,X);
  • insert into node_map (parent_id, child_id) values(A2B1B2,Y);
  • insert into node_map (parent_id, child_id) values(A2B1B2,Z);
  • insert into node_sibling_map (node_id, sibling_id) values(B1,B2);
  • insert into node_sibling_map (node_id, sibling_id) values(B2,B3);
  • insert into node_sibling_map (node_id, sibling_id) values(X,Y);
  • insert into node_sibling_map (node_id, sibling_id) values(Y,Z);

Upvotes: 1

Zlatin Zlatev
Zlatin Zlatev

Reputation: 3089

It is quite a complex problem, the one you are dealing with. It may be worth checking the following articles:

http://www.codeproject.com/Articles/22824/A-Model-to-Represent-Directed-Acyclic-Graphs-DAG-o http://www.freepatentsonline.com/6633886.html

Upvotes: 0

UltraOne
UltraOne

Reputation: 166

The only PHP library I found for manipulating graphs is the PEAR “Structures_Graph” package ( http://pear.php.net/manual/en/package.structures.structures-graph.php ). It is not currently being maintained, important features (e.g. removing a node) are not implemented, and serious bugs are open (e.g. inability to install under Windows 7). It does not look like that package, in its current form, would be useful.

The basic operations need to manipulate your graph can be split up into those that change the graph (mutating) and those that query it (non-mutating).

Mutating operations

  • CreateNode($nodeName), returns a $nodeID – Note that when nodes are created, they have no edges connecting them to other nodes in the graph.
  • DeleteNode($nodeID) – To enforce referential integrity, this should only be allowed if all edges connecting to the node have previously been deleted, and an exception thrown otherwise.
  • UpdateNodeName($nodeID, $newNodeName) – If it is allowed to change the name of an existing node.
  • CreateHorizontalEdge($souceNodeID, $destinationNodeID) – This is a directed edge. Throws exceptions if edge already exists, or if adding edge wound create a cycle.
  • DeleteHorizontalEdge($sourceNodeID, $destinationNodeID)
  • CreateVerticalEdge($firstNodeID, $secondNodeID) – This is a bidirectional edge, the first and second node IDs can be switched and the effect on the graph will be the same. Throws an exception if edge already exists or if the two nodes do not have the same horizontal parent.
  • DeleteVerticalEdge($firstNodeID, $secondNodeID) – As edge is non-directed, it will delete edge even if the arguments were in the opposite order for creation.

Examples: To build the section of your graph that has node names starting with “B” manually, the code would be:

$nodeID_B = CreateNode(“B”);
$nodeID_B1 = CreateNode(“B1”);
$nodeID_B2 = CreateNode(“B2”);
$nodeID_B3 = CreateNode(“B3”);
CreateHorizontalEdge($nodeID_B, $nodeID_B1);
CreateHorizontalEdge($nodeID_B, $nodeID_B2);
CreateHorizontalEdge($nodeID_B, $nodeID_B3);
CreateVerticalEdge($nodeID_B1, $nodeID_B2);
CreateVerticalEdge($nodeID_B2, $nodeID_B3);

Code to manually remove the node with name “B3”:

// Must remove all edges that connect to node first
DeleteVerticalEdge($nodeID_B2, $nodeID_B3);
DeleteHorizontalEdge($nodeID_B, $nodeID_B3);
// Now no edges connect to the node, so it can be safely deleted
DeleteNode($nodeID_B3);

Non-mutating operations

  • NodeExists($nodeID) – Returns true/false
  • GetNodeIDByName($nodeName) – Returns $nodeID
  • GetNodeName($nodeID)
  • HorizontalEdgeExists($sourceNodeID, $destinationNodeID) – Returns true/false
  • VerticalEdgeExists($firstNodeID, $secondNodeID) – Returns true/false, same result regardless of argument order.
  • HorizontalConnectionExists($startNodeID, $endNodeID) – Returns true/false – Following the horizontal arrows, is there a path from the start node to the end node? To test if creating a new horizontal edge from $nodeID1 to $nodeID2 will create a cycle, call HorizontalConnectionExists($nodeID2, $nodeID1).
  • GetHorizontalAncestorNodes($nodeID) – Returns an array of all nodes that have horizontal paths leading from them to the argument node.
  • GetHorizontalDescendentNodes($nodeID) – Returns an array of all nodes that have horizontal paths leading from the argument node to the argument node.
  • GetVerticalSiblingNodes($nodeID) – Returns an array of all nodes that have vertical connections to the argument node. Since (as per your answer to my clarifying question), vertical relationships are not transitive, the function VerticalEdgeExists (above) is the only one needed to query for vertical relationships.

Example: To get a list of nodes to be included in the subtree you describe in your question, combine the results of GetHorizontalAncestorNodes($nodeID) and GetVerticalSiblingNodes($nodeID).

Data Structures

You will always need a “Nodes” table to hold nodeID and nodeName. This table can be extended to hold other information associated with the nodes.

Since vertical edges are not transitive, information about them can simply be put in a “VerticalEdges” table with columns vEdgeID, firstNodeID, secondNodeID.

There are several options for how to store the horizontal edge information. On the one hand, the data structures and mutating operations can be simple, but at the cost of making some of query operations slower and more complex. On the other hand, the data structures can be slightly more complex but possibly much larger (growing exponentially with number of nodes in the worse case), with more complex mutating operations, but simpler and faster queries. Deciding which implementation is best for you will depend on the size of your graphs and how often they change compared to the number of times they are queried.

I will describe three possible data structures to represent the graph, moving from simple to complex. I will go into detail on the algorithms for the operations listed above only for the last set of data structures. I think that set of structures is best for smaller graphs that have a high ratio of queries to changes.

Note that all data structures have the “Nodes” and “VerticalEdges” tables I discussed above.

Minimal Data Structure

The first data structure has a “HorizontalEdges” table with columns hEdgeID, sourceNodeID and destinationNodeID. The mutating functions are simple, and most of the code will be error checking code that throws exceptions. The non-mutating functions HorizontalConnectionExists, GetHorizontalAncestorNodes and GetHorizontalDescendentNodes will be complex and potentially slow. Each time they are called, they will recursively traverse the HorizontalEdges table and collect nodeIDs. Those collections are returned directly for the later two functions, while HorizontalConnectionExists can terminate and return true if it finds the end node while searching descendants of the start node. It will return false if the search finishes without finding the end node.

Transitive Closure table

The second data structure also has a HorizontalEdges table identical to the one described above, but also has a second table “HorizontalTransitiveClosures” with columns hTCID, startNodeID and endNodeID. There is a row in this table for each pair of start node and end node such that a path using horizontal edges can be traced from the start node to the end node.

Example: For the graph in the question, the rows in this table that include node A (to simplify notation, I will use the names, rather than integer node IDs to identify nodes, and omit the hTCID column) are:

A, A2
A, A2B1
A, A2B1B2
A, X
A, Y
A, Z

Rows that include node A2B1 (the first one is also in the set above) are:

A, A2B1
A2, A2B1
B, A2B1
B1, A2B1
A2B1, A2B1B2
A2B1, X
A2B1, Y
A2B1, Z

In a worse case scenario, this table scales as the square of the number of nodes.

With this data structure, HorizontalConnectionExists, GetHorizontalAncestorNodes and GetHorizontalDescendentNodes can be implemented as simple searches of the HorizontalTransitiveClosures table. The complexity is moved to the CreateHorizontalEdge and DeleteHorizontalEdge functions. DeleteHorizontalEdge is particularly complex and requires a fair bit of explanation as to why the algorithm works.

Transitive Closure with Paths

The final data structure I will discuss stores the horizontal edge information in two tables. The first, “HorizontalTransitiveClosurePaths” has columns hTCPathID, startNodeID, endNodeID, pathLength. The second table “PathLinks” has columns PathLinkID, hTCPathID, sourceNodeID, destinationNodeID.

The HorizontalTransitiveClosurePaths table is similar to the HorizontalTransitiveClosures table in the data structure described above, but it has one row for each possible path that can accomplish a transitive closure, rather than one row per transitive closure. For example, in the graph shown in the question, the HorizontalTransitiveClosures table would have one row (B, A2B1B2) (shorthand notation as above) for the closure that started at B and ended A2B1B2. The HorizontalTransitiveClosurePaths table would have two rows: (10, B, A2B1B2, 3) and (11, B, A2B1B2, 2), since there are two ways to get from B to A2B1B2. The PathLinks table describes each of those paths with one row per edge making up the path. For the two paths from B to A2B1B2, the rows are:

101, 10, B, B1
102, 10, B1, A2B1
103, 10, A2B1, A2B1B2
104, 11, B, B2
105, 11, B2, A2B1B2

There is no need for a HorizonalEdges table, since the edges can be found by selecting the rows in the HorizontalTransitiveClosurePaths table with a length of 1.

The query functions are implemented the same way as in the Transitive Closure data structure described above. Since multiple paths may exist for a closure, the GROUP BY operator will need to be used. For example, the SQL query that returns all nodes that are ancestors of the node with ID :nodeid is: SELECT startNodeID from HorizontalTransitiveClosurePaths WHERE endNodeID = :nodeid GROUP BY startNodeID

To implement DeleteHorizontalEdge, search the PathLinks for the hTCPathID of all paths that contain the edge. Then delete those paths from the HorizontalTransitiveClosurePaths table and all edges associated with the paths from the PathLinks table.

To implement CreateHorizontalEdge($souceNodeID, $destinationNodeID), first search HorizontalTransitiveClosurePaths table for paths that end at $souceNodeID – this is the “ancestor path set”. Search the HorizontalTransitiveClosurePaths for paths that start at destinationNodeID – this is the “descendent path set”. Now insert new paths from the following 4 groups (some of which may be empty) into the HorizontalTransitiveClosurePaths table and insert the links for those paths in the PathLinks table.

  1. The length 1 path from $souceNodeID to $destinationNodeID
  2. For each path in the ancestor path set, a new path that extends the end of the path by one additional link from $souceNodeID to $destinationNodeID
  3. For each path in the descendant path set, a new path that extends the start of the path by one additional link from $souceNodeID to $destinationNodeID
  4. For each combination of one path from the ancestor path set and one path from the descendant path set, a path created by slicing the ancestor path, to the link from $souceNodeID to $destinationNodeID, to the descendant path.

Example: A graph consists of 6 nodes: A1, A2, B, C, D1 and D2. It has 4 edges, (A1, B), (A2, B), (C, D1), (C, D2). The HorizontalTransitiveClosurePaths table (using node name rather than a number) is:

1, A1, B, 1
2, A2, B, 1
3, C, D1, 1
4, C, D2, 1

The PathLinks table is:

1, 1, A1, B
2, 2, A2, B
3, 3, C, D1
4, 4, C, D2

Now we add the edge from B to C. The ancestor path set is (1, 2) and the descendant path set is (3, 4) The paths added in each of the 4 groups are:

  1. The new edge itself: HorizontalTransitiveClosurePaths: (5, B, C, 1); PathLinks (5, 5, B, C)
  2. Extend each ancestor path by one link at the end:
    HorizontalTransitiveClosurePaths: 
        6, A1, C, 2
        7, A2, C, 2
    
    PathLinks:
        6, 6, A1, B
        7, 6, B, C
        8, 7, A2, B
        9, 7, B, C
    
  3. Extend each descendent path by one link at the start:
    HorizontalTransitiveClosurePaths:
        8, B, D1, 2
        9, B, D2, 2
    
    PathLinks:
        10, 8, B, C
        11, 8, C, D1
        12, 9, B, C
        13, 9, C, D2
    
  4. Splice together all combinations containing one ancestor path and one descendant path:
    HorizontalTransitiveClosurePaths:
        10, A1, D1, 3
        11, A1, D2, 3
        12, A2, D1, 3
        13, A2, D2, 3
    
    PathLinks:
        14, 10, A1, B
        15, 10, B, C
        16, 10, C, D1
        17, 11, A1, B
        18, 11, B, C  
        19, 11, C, D2
        20, 12, A2, B
        21, 12, B, C
        22, 12, C, D1
        23, 13, A2, B
        24, 13, B, C
        25, 13, C, D2
    

Let me know if any part of the answer needs additional clarification.

Upvotes: 2

Moein Hosseini
Moein Hosseini

Reputation: 4383

Your db structure doesn't seem to be a tree,it's just graph.

I suggest you to throw out relational database for this structure and take look at some of Graph Database like Neo4j ,OrientDB and Infinite Graph.

But if your forced to use MySQL you can use FlockDB which can be used to traverse MySQL Node(see row as node) but with some limitation. or you can test another MySQL engine like OQGRAPH which provide graph engine for MySQL and MariaDB.

Upvotes: 5

leftclickben
leftclickben

Reputation: 4614

Your database structure is not normalised, because you have multiple ids in both node_parent_id and cross_refer. You should separate this information out into separate tables.

So, you would have your nodes table, plus a second table to describe the parent-child relationships; this table would have a child node id and a parent node id.

The cross-references should be in a third table, which again has two node id columns, but there are two ways you can do this, because the cross-references are bi-directional. One way is to store each cross-reference only once, which means when you query the table, you have to check both possibilities (a cross reference between X and Y could be stored with X in the first column and Y in the second column, or the other way around, so to find X you would have to check both columns). The other way is to store each cross-reference twice, once for each direction. This makes querying simpler but it is storing redundant data and can lead to inconsistencies, e.g. if for some reason one reference is deleted but the other is not.

Finding paths with this structure becomes a much simpler matter because you don't have to additionally parse comma-separated strings, which is both more complex and less efficient.

You can also use this to ensure referential integrity, so that, for example, a node doesn't have a parent id that doesn't actually exist in the database.

For more information, research "database normalisation". (Or you can spell it with a "z" if you are so inclined ;-P)

Upvotes: 3

Related Questions