Reputation: 1918
I have a dataframe where some rows have a parent-child relationship. For example, 1002 is the parent of 1003, 1003 is the parent of 1004
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['Id', 'Parent Id', 'Child Id', 'A', 'B'],
data=[[1001, np.nan, 1005, 'A1001', 'B1001'],
[1002, np.nan, 1003, 'A1002', 'B1002'],
[1003, 1002, 1004, 'A1003', np.nan],
[1004, 1003, np.nan, 'A1004', np.nan],
[1005, 1001, np.nan, 'A1005', np.nan]
])
print(df)
Id Parent Id Child Id A B
0 1001 NaN 1005.0 A1001 B1001
1 1002 NaN 1003.0 A1002 B1002
2 1003 1002.0 1004.0 A1003 NaN
3 1004 1003.0 NaN A1004 NaN
4 1005 1001.0 NaN A1005 NaN
I want to merge the rows to keep only those rows without child item. I understand I can do it with
df = df[df['Child Id'].isnull()]
However, some information appear only in the parent row but missing from child rows, e.g. column 'B' from 1002. I want to inherit it to the latest child row.
Below is the expected output:
Id Parent Id Child Id A B
3 1004 1003.0 NaN A1004 B1002
4 1005 1001.0 NaN A1005 B1001
Instead of iterating the rows, are there built-in pandas method I can make use of?
Upvotes: 0
Views: 328
Reputation: 3495
You need to get the value in B
column down the chain of parent-child relationships. You can do it with:
for i in range(max_length_of_relationships):
df = df.merge(df[['Id', 'B']].rename({'Id': 'Parent Id', 'B': 'Parent B'}, axis=1), how='left')
df['B'] = df['B'].combine_first(df['Parent B'])
df.drop('Parent B', axis=1, inplace=True)
Where max_length_of_relationships
is the maximun number of parent-child links in one chain (2 in the df: 1. 10002-10003, 2.10003-10004), if you are not sure what the number is, use a large number that can not be surpassed.
And then leave only the rows without childs as you did with:
df = df[df['Child Id'].isnull()]
Upvotes: 1