Map Dataframe Columns Based on Priority

Question

I have a main dataframe(main_df) like:

     A     B       X    Y     Id1
0  cat  cat1  catabc  0.1  uuid01
1  cat  cat1  catxyz  0.4  uuid02
2  cat  cat2  catpqr  0.5  uuid01
3  dog  dog1  dogxyz  0.3  uuid03
4  dog  dog2  dogpqr  0.2  uuid02
5  dog  dog2  dogabc  0.8  uuid01

And another dataframe(map_df) which will help in mapping:

     A     B       X    Y   Id2
0  cat  cat1  catxyz  0.4  nx01
1  cat  cat1     NaN  NaN  nx02
2  cat   NaN     NaN  NaN  nx03
3  dog  dog1  dogxyz  NaN  nx04
4  dog   NaN     NaN  NaN  nx05

The priority of mapping for columns will be in the order: ['A', 'B', 'X', 'Y'].

So, if all cells of a row map_df match with any row of main_df, then the corresponding element of Id2 should be added to main_df. If Y is not present, then the mapping should proceed with ['A', 'B', 'X']; if X is also absent, it should proceed with ['A', 'B'] and so on..

I was able to achieve partial result with dataframe merge. For example:

main_df = main_df.merge(map_df, how='left', on=['A', 'B', 'X', 'Y']))

But I am unable to figure out the priority-based angle here.

As a result of mapping, the dataframe(result_df) should be like:

     A     B       X    Y     Id1   Id2
0  cat  cat1  catabc  0.1  uuid01  nx02
1  cat  cat1  catxyz  0.4  uuid02  nx01
2  cat  cat2  catpqr  0.5  uuid01  nx03
3  dog  dog1  dogxyz  0.3  uuid03  nx04
4  dog  dog2  dogpqr  0.2  uuid02  nx05
5  dog  dog2  dogabc  0.8  uuid01  nx05

Acorbe · Accepted Answer

This would solve. Not very pretty though.

It is an iterative algorithm that traverses in forward and then in backward direction your data.

The forward pass resolves the merges with progressively lower priority, eliminating the matches from the data structure. It uses an inner merge strategy.

The backward pass updates the output from lowest to highest priority. The final update call is a bit slow, I noticed.

merge_on = ['A', 'B', 'X', 'Y']
tot_cols = len(merge_on)

operation_main_df = main_df.copy()
outputs = []

# forward pass on progressively smaller sets
for first_idx in range(len(merge_on)):
    
    merged_id2 = operation_main_df.merge(map_df, how='inner', 
           on=merge_on[0:tot_cols-first_idx],right_index=True,suffixes={"","_y"})                                                             
   
    # the relevant output has the right number of NaN columns, the rest is garbage.
    outputs.append(merged_id2[merged_id2.isna().sum(axis=1)==first_idx])


# backward updating pass
reverse_it = iter(outputs[::-1])

out = next(reverse_it)[['A','B','X','Y','Id1','Id2']].copy()
for el in reverse_it:
    out.update(el)

output:


A   B   X   Y   Id1 Id2
0   cat cat1    catabc  0.1 uuid01  nx02
1   cat cat1    catxyz  0.4 uuid02  nx01
2   cat cat2    catpqr  0.5 uuid01  nx03
3   dog dog1    dogxyz  0.3 uuid03  nx04
4   dog dog2    dogpqr  0.2 uuid02  nx05
5   dog dog2    dogabc  0.8 uuid01  nx05

EDIT: accelerates of a 10x factor on my machine

# change
out = next(reverse_it)[['A','B','X','Y','Id1','Id2']]

# into
out = next(reverse_it)[['A','B','X','Y','Id1','Id2']].copy()

Map Dataframe Columns Based on Priority

Answers (2)

output (using weighted method)

Related Questions