somal
somal

Reputation: 53

Pandas dynamically flatten a parent child hierarchy

I want dynamically flatten a parent child hierarchy pandas dataframe.

Notes:

Input example:

import pandas as pd
import numpy as np
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.options.display.expand_frame_repr = False
pd.options.mode.chained_assignment = None


df = pd.DataFrame(
    {
        "child": ["xyz", "opr", "axz", "asd", "asd", "zxc", "zxc", "zxc"],
        "parent": [np.nan, "xyz", "xyz", "opr", "opr", "opr", "axz", "xyz"],
    }
)

print(df)

Expected output:

  level_0 level_1 leaf
0     xyz     opr  asd
1     xyz     opr  asd
2     xyz     opr  zxc
3     xyz     axz  zxc
4     xyz     NaN  zxc

Upvotes: 1

Views: 2679

Answers (1)

Serge Ballesta
Serge Ballesta

Reputation: 148870

The leafs are the elements from the child column that do not exist in parent one.

Once this is done, I would iterate adding a new parent column on every pass until all parents are NaN. And another trick is required to ensure that the last level will contain the ultimate grand_parents: if a parent column contains a NaN, the values have to be switched with the previous column. Code:

result = df[~df['child'].isin(df['parent'])]
result.columns = ['leaf', 'lev_1']
ix = 1
while True:
    result = result.merge(df, 'left', left_on=f'lev_{ix}', right_on='child'
                          ).drop(columns='child')
    if (result['parent'].isna().all()):
        result = result.drop(columns='parent')
        break
    result.loc[result['parent'].isna(), f'lev_{ix}':'parent'
               ] = result[result['parent'].isna()][['parent', f'lev_{ix}']
                                                   ].values
    print(result)
    ix += 1
    result = result.rename(columns={'parent': f'lev_{ix}'})
# rename and reorder columns to match your expected result
result = result.rename(columns={f'lev_{ix-i}': f'lev_{i}' for i in range(ix)}
                       ).reindex(columns=[f'lev_{i}' for i in range(ix)]
                                 + ['leaf'])

It gives as expected:

          lev_0     lev_1     leaf
0  grand parent  parent 1  child 1
1  grand parent  parent 1  child 1
2  grand parent  parent 1  child 2
3  grand parent  parent 2  child 2
4  grand parent       NaN  child 2

and should accept an arbitrary number of levels

Upvotes: 1

Related Questions