Asif Iqbal
Asif Iqbal

Reputation: 511

Map Dataframe Columns Based on Priority

I have a main dataframe(main_df) like:

     A     B       X    Y     Id1
0  cat  cat1  catabc  0.1  uuid01
1  cat  cat1  catxyz  0.4  uuid02
2  cat  cat2  catpqr  0.5  uuid01
3  dog  dog1  dogxyz  0.3  uuid03
4  dog  dog2  dogpqr  0.2  uuid02
5  dog  dog2  dogabc  0.8  uuid01

And another dataframe(map_df) which will help in mapping:

     A     B       X    Y   Id2
0  cat  cat1  catxyz  0.4  nx01
1  cat  cat1     NaN  NaN  nx02
2  cat   NaN     NaN  NaN  nx03
3  dog  dog1  dogxyz  NaN  nx04
4  dog   NaN     NaN  NaN  nx05

The priority of mapping for columns will be in the order: ['A', 'B', 'X', 'Y'].

So, if all cells of a row map_df match with any row of main_df, then the corresponding element of Id2 should be added to main_df. If Y is not present, then the mapping should proceed with ['A', 'B', 'X']; if X is also absent, it should proceed with ['A', 'B'] and so on..

I was able to achieve partial result with dataframe merge. For example:

main_df = main_df.merge(map_df, how='left', on=['A', 'B', 'X', 'Y']))

But I am unable to figure out the priority-based angle here.

As a result of mapping, the dataframe(result_df) should be like:

     A     B       X    Y     Id1   Id2
0  cat  cat1  catabc  0.1  uuid01  nx02
1  cat  cat1  catxyz  0.4  uuid02  nx01
2  cat  cat2  catpqr  0.5  uuid01  nx03
3  dog  dog1  dogxyz  0.3  uuid03  nx04
4  dog  dog2  dogpqr  0.2  uuid02  nx05
5  dog  dog2  dogabc  0.8  uuid01  nx05

Upvotes: 4

Views: 283

Answers (2)

Rob Raymond
Rob Raymond

Reputation: 31166

Using concepts of how an SQL engine can work with complex query. Generate a partial Cartesian product then filter down results to required set.

In this case a calculated match (weight)

result_df = (main_df
 # minimum match - just A, effectivaly a partial catersian producr
 .merge(map_df, on=["A"], suffixes=("","_m"))
 # calculate weight of match.  NaN is neutral, different penalise, same reward
 .assign(
     B_w=lambda dfa: np.where(dfa["B_m"].isna(), 0, np.where(dfa["B"]==dfa["B_m"],1,-1)*1000),
     X_w=lambda dfa: np.where(dfa["X_m"].isna(), 0, np.where(dfa["X"]==dfa["X_m"],1,-1)*100),
     Y_w=lambda dfa: np.where(dfa["Y_m"].isna(), 0, np.where(dfa["Y"]==dfa["Y_m"],1,-1)*10),
     w=lambda dfa: dfa.loc[:,["B_w","X_w","Y_w"]].sum(axis=1)
 )
# biggest weight is one we want
 .sort_values(["A","B","X","Y","w"], ascending=[True,True,True,True,False])

 .groupby(["A","B","X","Y"]).first()
#  # cleanup and don't show workings...
 .reset_index()
 .loc[:,["A","B","X","Y","Id1","Id2"]]
)

In this case it's the longest available key.

main_df = pd.read_csv(io.StringIO("""     A     B       X    Y     Id1
0  cat  cat1  catabc  0.1  uuid01
1  cat  cat1  catxyz  0.4  uuid02
2  cat  cat2  catpqr  0.5  uuid01
3  dog  dog1  dogxyz  0.3  uuid03
4  dog  dog2  dogpqr  0.2  uuid02
5  dog  dog2  dogabc  0.8  uuid01"""), sep="\s+")
map_df = pd.read_csv(io.StringIO("""     A     B       X    Y   Id2
0  cat  cat1  catxyz  0.4  nx01
1  cat  cat1     NaN  NaN  nx02
2  cat   NaN     NaN  NaN  nx03
3  dog  dog1  dogxyz  NaN  nx04
4  dog   NaN     NaN  NaN  nx05"""), sep="\s+")

# result_df = 
result_df = (main_df
 # minimum match - just A, effectively a partial catersian product
 .merge(map_df, on=["A"], suffixes=("","_m"))
 # simpler to use empty string for NaN
 .fillna("")
 # synthetic column of rest of key and length
 .assign(c=lambda dfa: dfa["B_m"]+dfa["X_m"]+dfa["Y_m"].astype(str),
                l=lambda dfa: dfa["c"].str.len())
 # longest synthetic key is one we want
 .sort_values(["A","B","X","Y","l"], ascending=[True,True,True,True,False])
 .groupby(["A","B","X","Y"]).first()
 # cleanup and don't show workings...
 .reset_index()
 .loc[:,["A","B","X","Y","Id1","Id2"]]
)

output (using weighted method)

     A     B       X    Y     Id1   Id2
0  cat  cat1  catabc  0.1  uuid01  nx02
1  cat  cat1  catxyz  0.4  uuid02  nx01
2  cat  cat2  catpqr  0.5  uuid01  nx03
3  dog  dog1  dogxyz  0.3  uuid03  nx04
4  dog  dog2  dogabc  0.8  uuid01  nx05
5  dog  dog2  dogpqr  0.2  uuid02  nx05

Upvotes: 1

Acorbe
Acorbe

Reputation: 8391

This would solve. Not very pretty though.

It is an iterative algorithm that traverses in forward and then in backward direction your data.

The forward pass resolves the merges with progressively lower priority, eliminating the matches from the data structure. It uses an inner merge strategy.

The backward pass updates the output from lowest to highest priority. The final update call is a bit slow, I noticed.

merge_on = ['A', 'B', 'X', 'Y']
tot_cols = len(merge_on)

operation_main_df = main_df.copy()
outputs = []

# forward pass on progressively smaller sets
for first_idx in range(len(merge_on)):
    
    merged_id2 = operation_main_df.merge(map_df, how='inner', 
           on=merge_on[0:tot_cols-first_idx],right_index=True,suffixes={"","_y"})                                                             
   
    # the relevant output has the right number of NaN columns, the rest is garbage.
    outputs.append(merged_id2[merged_id2.isna().sum(axis=1)==first_idx])


# backward updating pass
reverse_it = iter(outputs[::-1])

out = next(reverse_it)[['A','B','X','Y','Id1','Id2']].copy()
for el in reverse_it:
    out.update(el)


output:


A   B   X   Y   Id1 Id2
0   cat cat1    catabc  0.1 uuid01  nx02
1   cat cat1    catxyz  0.4 uuid02  nx01
2   cat cat2    catpqr  0.5 uuid01  nx03
3   dog dog1    dogxyz  0.3 uuid03  nx04
4   dog dog2    dogpqr  0.2 uuid02  nx05
5   dog dog2    dogabc  0.8 uuid01  nx05

EDIT: accelerates of a 10x factor on my machine

# change
out = next(reverse_it)[['A','B','X','Y','Id1','Id2']]

# into
out = next(reverse_it)[['A','B','X','Y','Id1','Id2']].copy()

Upvotes: 1

Related Questions