Pandas df loop + merge

Question

Hello guys I need your wisdom,

I'm still new to python and pandas and I'm looking to achieve the following thing.

df = pd.DataFrame({'code': [125, 265, 128,368,4682,12,26,12,36,46,1,2,1,3,6], 'parent': [12,26,12,36,46,1,2,1,3,6,'a','b','a','c','f'], 'name':['unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','g1','g2','g1','g3','g6']})

ds = pd.DataFrame({'code': [125, 265, 128,368,4682], 'name': ['Eagle','Cat','Koala','Panther','Dophin']})

I would like to add a new column in the ds dataframe with the name of the highest parent.

as an example for the first row :

code | name | category
125 | Eagle | a

"a" is the result of a loop between df.code and df.parent 125 > 12 > 1 > a

Since the last parent is not a number but a letter i think I must use a regex and than .merge from pandas to populate the ds['category'] column. Also maybe use an apply function but it seems a little bit above my current knowledge.

Could anyone help me with this?

Regards,

Stef · Accepted Answer

The following is certainly not the fastest solution but it works if your dataframes are not too big. First create a dictionary from the parent codes of df and then apply this dict recursively until you come to an end.

p = df[['code','parent']].set_index('code').to_dict()['parent']

def get_parent(code):
    while par := p.get(code):
        code = par
    return code

ds['category'] = ds.code.apply(get_parent)

Result:

   code     name category
0   125    Eagle        a
1   265      Cat        b
2   128    Koala        a
3   368  Panther        c
4  4682   Dophin        f

PS: get_parent uses an assignment expression (Python >= 3.8), for older versions of Python you could use:

def get_parent(code):
    while True:
        par = p.get(code)
        if par:
            code = par
        else:
            return code

Pandas df loop + merge

Answers (1)

Related Questions