henrywongkk
henrywongkk

Reputation: 1918

Merge multiple rows from same dataframe based on parent id

I have a dataframe where some rows have a parent-child relationship. For example, 1002 is the parent of 1003, 1003 is the parent of 1004

    import pandas as pd
    import numpy as np

    df = pd.DataFrame(columns=['Id', 'Parent Id', 'Child Id', 'A', 'B'],
                      data=[[1001, np.nan, 1005, 'A1001', 'B1001'],
                            [1002, np.nan, 1003, 'A1002', 'B1002'],
                            [1003, 1002, 1004, 'A1003', np.nan],
                            [1004, 1003, np.nan, 'A1004', np.nan],
                            [1005, 1001, np.nan, 'A1005', np.nan]
                            ])
    print(df)

         Id  Parent Id  Child Id      A      B
    0  1001        NaN    1005.0  A1001  B1001
    1  1002        NaN    1003.0  A1002  B1002
    2  1003     1002.0    1004.0  A1003    NaN
    3  1004     1003.0       NaN  A1004    NaN
    4  1005     1001.0       NaN  A1005    NaN

I want to merge the rows to keep only those rows without child item. I understand I can do it with

    df = df[df['Child Id'].isnull()]

However, some information appear only in the parent row but missing from child rows, e.g. column 'B' from 1002. I want to inherit it to the latest child row.

Below is the expected output:

        Id  Parent Id  Child Id      A      B
    3  1004     1003.0       NaN  A1004  B1002
    4  1005     1001.0       NaN  A1005  B1001

Instead of iterating the rows, are there built-in pandas method I can make use of?

Upvotes: 0

Views: 328

Answers (1)

Aryerez
Aryerez

Reputation: 3495

You need to get the value in B column down the chain of parent-child relationships. You can do it with:

for i in range(max_length_of_relationships):
    df = df.merge(df[['Id', 'B']].rename({'Id': 'Parent Id', 'B': 'Parent B'}, axis=1), how='left')
    df['B'] = df['B'].combine_first(df['Parent B'])
    df.drop('Parent B', axis=1, inplace=True)

Where max_length_of_relationships is the maximun number of parent-child links in one chain (2 in the df: 1. 10002-10003, 2.10003-10004), if you are not sure what the number is, use a large number that can not be surpassed. And then leave only the rows without childs as you did with:

df = df[df['Child Id'].isnull()]

Upvotes: 1

Related Questions