Sid Kwakkel
Sid Kwakkel

Reputation: 799

Pandas: flattening a tree structure

I have a tree of categories represented by the following.

import pandas as pd

asset_tree = [
    {'id': 1, 'name': 'Linear Asset', 'parent_id': -1},
    {'id': 2, 'name': 'Lateral', 'parent_id': 1},
    {'id': 3, 'name': 'Main', 'parent_id': 1},
    {'id': 4, 'name': 'Point Asset', 'parent_id': -1},
    {'id': 5, 'name': 'Fountain', 'parent_id': 4},
    {'id': 6, 'name': 'Hydrant', 'parent_id': 4}
]
tree = pd.DataFrame(asset_tree)
print(tree)

This gives me a dataframe as follows:

   id          name  parent_id
0   1  Linear Asset         -1
1   2       Lateral          1
2   3          Main          1
3   4   Point Asset         -1
4   5      Fountain          4
5   6       Hydrant          4

The highest nodes in the tree have parent_id equal to -1, and so the tree can graphically be represented as follows:

Linear Asset
   | - Lateral
   | - Main
Point Asset
   | - Fountain
   | - Hydrant

I need to generate the following dataframe.

   id          name  parent_id  flat_name
0   1  Linear Asset         -1  Linear Asset
1   2       Lateral          1  Linear Asset : Lateral
2   3          Main          1  Linear Asset : Main
3   4   Point Asset         -1  Point Asset
4   5      Fountain          4  Point Asset : Fountain
5   6       Hydrant          4  Point Asset : Hydrant

The tree is generated dynamically and can have any number of levels and so the following tree

asset_tree = [
    {'id': 1, 'name': 'Linear Asset', 'parent_id': -1},
    {'id': 2, 'name': 'Lateral', 'parent_id': 1},
    {'id': 3, 'name': 'Main', 'parent_id': 1},
    {'id': 4, 'name': 'Point Asset', 'parent_id': -1},
    {'id': 5, 'name': 'Fountain', 'parent_id': 4},
    {'id': 6, 'name': 'Hydrant', 'parent_id': 4},
    {'id': 7, 'name': 'Steel', 'parent_id': 2},
    {'id': 8, 'name': 'Plastic', 'parent_id': 2},
    {'id': 9, 'name': 'Steel', 'parent_id': 3},
    {'id': 10, 'name': 'Plastic', 'parent_id': 3}
]

should result in the following:

   id          name  parent_id  flat_name
0   1  Linear Asset         -1  Linear Asset
1   2       Lateral          1  Linear Asset : Lateral
2   3          Main          1  Linear Asset : Main
3   4   Point Asset         -1  Point Asset
4   5      Fountain          4  Point Asset : Fountain
5   6       Hydrant          4  Point Asset : Hydrant
6   7         Steel          2  Linear Asset : Lateral : Steel
7   8       Plastic          2  Linear Asset : Lateral : Plastic
8   9         Steel          3  Linear Asset : Main : Steel
9  10       Plastic          3  Linear Asset : Main : Plastic

Upvotes: 3

Views: 1012

Answers (3)

Tom
Tom

Reputation: 8790

Here is a recursive apply function for accomplishing this. The function takes in an id and returns its "path" through the tree:

def flatname(ID):
    row = df[df['id'] == ID].squeeze()
    if row['parent_id'] == -1:
        return row['name']
    else:
        return flatname(row['parent_id']) + ' : ' + row['name']

To use, call:

df['flat_name'] = df['id'].apply(flatname)

The df after used on your second example:

   id          name  parent_id                         flat_name
0   1  Linear Asset         -1                      Linear Asset
1   2       Lateral          1            Linear Asset : Lateral
2   3          Main          1               Linear Asset : Main
3   4   Point Asset         -1                       Point Asset
4   5      Fountain          4            Point Asset : Fountain
5   6       Hydrant          4             Point Asset : Hydrant
6   7         Steel          2    Linear Asset : Lateral : Steel
7   8       Plastic          2  Linear Asset : Lateral : Plastic
8   9         Steel          3       Linear Asset : Main : Steel
9  10       Plastic          3     Linear Asset : Main : Plastic

OP noted that the above function refers explicitly to the df variable defined outside of the function-scope. So if you call your DataFrame something different, or you want to call this on many DataFrames, this could cause problems. One fix would be to turn the apply function into more of a private helper, and create an external (more user-friendly) function which calls it:

def _flatname_recurse(ID, df):
    row = df[df['id'] == ID].squeeze()
    if row['parent_id'] == -1:
        return row['name']
    else:
        return _flatname_recurse(row['parent_id'], df=df) + ' : ' + row['name']

# asset_df to specify we are looking for a specific kind of df
def flatnames(asset_df):
    return asset_df['id'].apply(_flatname_recurse, df=asset_df)

Then call with:

df['flat_name'] = flatnames(df)

Also, note that I used to use row = df.iloc[ID - 1, :] for identifying the row, which worked in this case but was dependent on the id being one greater than the index. This approach is more general.

Upvotes: 5

Ajax1234
Ajax1234

Reputation: 71451

You can use recursion to find the path to the parent id:

import pandas as pd
asset_tree = [{'id': 1, 'name': 'Linear Asset', 'parent_id': -1}, {'id': 2, 'name': 'Lateral', 'parent_id': 1}, {'id': 3, 'name': 'Main', 'parent_id': 1}, {'id': 4, 'name': 'Point Asset', 'parent_id': -1}, {'id': 5, 'name': 'Fountain', 'parent_id': 4}, {'id': 6, 'name': 'Hydrant', 'parent_id': 4}]
a_tree = {i['id']:i for i in asset_tree} #to dictionary for more efficient lookup
def get_parent(d, c = []):
   if (k:=a_tree.get(d['parent_id'])) is None:
      return c + [d['name']]
   return get_parent(k, c+[d['name']])

r = [{**i, 'flat_name':' : '.join(get_parent(i)[::-1])} for i in asset_tree]
df = pd.DataFrame(r)

Output:

    id         name  parent_id               flat_name
0   1  Linear Asset         -1            Linear Asset
1   2       Lateral          1  Linear Asset : Lateral
2   3          Main          1     Linear Asset : Main
3   4   Point Asset         -1             Point Asset
4   5      Fountain          4  Point Asset : Fountain
5   6       Hydrant          4   Point Asset : Hydrant

On your larger asset_tree:

asset_tree = [{'id': 1, 'name': 'Linear Asset', 'parent_id': -1}, {'id': 2, 'name': 'Lateral', 'parent_id': 1}, {'id': 3, 'name': 'Main', 'parent_id': 1}, {'id': 4, 'name': 'Point Asset', 'parent_id': -1}, {'id': 5, 'name': 'Fountain', 'parent_id': 4}, {'id': 6, 'name': 'Hydrant', 'parent_id': 4}, {'id': 7, 'name': 'Steel', 'parent_id': 2}, {'id': 8, 'name': 'Plastic', 'parent_id': 2}, {'id': 9, 'name': 'Steel', 'parent_id': 3}, {'id': 10, 'name': 'Plastic', 'parent_id': 3}]
a_tree = {i['id']:i for i in asset_tree}
r = [{**i, 'flat_name':' : '.join(get_parent(i)[::-1])} for i in asset_tree]
df = pd.DataFrame(r)

Output:

   id          name  parent_id                         flat_name
0   1  Linear Asset         -1                      Linear Asset
1   2       Lateral          1            Linear Asset : Lateral
2   3          Main          1               Linear Asset : Main
3   4   Point Asset         -1                       Point Asset
4   5      Fountain          4            Point Asset : Fountain
5   6       Hydrant          4             Point Asset : Hydrant
6   7         Steel          2    Linear Asset : Lateral : Steel
7   8       Plastic          2  Linear Asset : Lateral : Plastic
8   9         Steel          3       Linear Asset : Main : Steel
9  10       Plastic          3     Linear Asset : Main : Plastic

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150725

This is a network problem, try networkx:

import networkx as nx

# build the graph
G = nx.from_pandas_edgelist(tree, source='parent_id', target='id',
                            create_using=nx.DiGraph)

# map id to name
node_names = tree.set_index('id')['name'].to_dict()

# get path from root (-1) to the node
def get_path(node):
    # this is a tree, so exactly one simple path for each node
    for path in nx.simple_paths.all_simple_paths(G, -1, node):
        return ' : '.join(node_names.get(i) for i in path[1:])

tree['flat_name'] = tree['id'].apply(get_path)

Output:

   id          name  parent_id                         flat_name
0   1  Linear Asset         -1                      Linear Asset
1   2       Lateral          1            Linear Asset : Lateral
2   3          Main          1               Linear Asset : Main
3   4   Point Asset         -1                       Point Asset
4   5      Fountain          4            Point Asset : Fountain
5   6       Hydrant          4             Point Asset : Hydrant
6   7         Steel          2    Linear Asset : Lateral : Steel
7   8       Plastic          2  Linear Asset : Lateral : Plastic
8   9         Steel          3       Linear Asset : Main : Steel
9  10       Plastic          3     Linear Asset : Main : Plastic

Upvotes: 3

Related Questions