Takver
Takver

Reputation: 182

Automatically identify parent/child connections from data that isn't already in tree format

I'm not sure if this is possible or not, but I'm having a hard time figuring out where to start reading to find out.

I have a large amount of data like below:

          0    1   2    3   4
      xyres zres fms flts pts
11020     1    1   0    2   0
11105     1    1   1    0   5
10005     1    0   0    0   5
01106     0    1   1    0   6
01001     0    1   0    0   1
10121     1    0   1    2   1
00016     0    0   0    1   6
01127     0    1   1    2   7
01010     0    1   0    1   0
10001     1    0   0    0   1

I'd like to convert it to a tree structure, like so, where each node has the same parent node if the variable to the left of it has the same value.

xyres zres   fms  flts  pts


        ______0     ____6
       |      |____|
 ______0           1
|                  
|              ____0
|             |    |____1
0       ______0
|      |      |     ____1
|      |      |    |
|      |      |____1    
|______|           
       1       ____0
       |______|    |____6
              1
              |____
                   2
                   |____7


               ____0
              |    |____
        ______0         1
       |
 ______0
|      |______
1             1...etc.
|______
       1 .....etc.

Is it possible to do this automatically, so that I can obtain data in a tree structure that I can then use with packages like networkx or pygraphviz? Alternatively, any tips for basic introductory reading on creating tree data structures, for someone without any formal programming background? What I've found so far all assumes that you already have data in the correct format and is about manipulating it, not about creating it from scratch.

Upvotes: 0

Views: 168

Answers (1)

andersource
andersource

Reputation: 829

You can try:

import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd

G = nx.Graph()

df = pd.read_csv('data.csv')
keys = list(df.groupby(list(df.columns)).count().index)

def key2id(key):
        return '-'.join(map(str, key))

for key in keys:
        prev = None
        for i in range(1, len(key) + 1):
                k = key2id(key[:i])
                G.add_node(k)
                if prev is not None:
                        G.add_edge(prev, k)
                prev = k

nx.draw(G, with_labels=True)
plt.show()

Output: enter image description here

Short explanation: First we groupby by all the relevant columns to eliminate duplicates. Each remaining row represents a leaf node; we iterate over all the leaf nodes and add all the intermediate nodes (along with the relevant edge).

Upvotes: 1

Related Questions