Reputation: 511
I have a main dataframe(main_df) like:
A B X Y Id1
0 cat cat1 catabc 0.1 uuid01
1 cat cat1 catxyz 0.4 uuid02
2 cat cat2 catpqr 0.5 uuid01
3 dog dog1 dogxyz 0.3 uuid03
4 dog dog2 dogpqr 0.2 uuid02
5 dog dog2 dogabc 0.8 uuid01
And another dataframe(map_df) which will help in mapping:
A B X Y Id2
0 cat cat1 catxyz 0.4 nx01
1 cat cat1 NaN NaN nx02
2 cat NaN NaN NaN nx03
3 dog dog1 dogxyz NaN nx04
4 dog NaN NaN NaN nx05
The priority of mapping for columns will be in the order: ['A', 'B', 'X', 'Y']
.
So, if all cells of a row map_df
match with any row of main_df
, then the corresponding element of Id2
should be added to main_df
. If Y
is not present, then the mapping should proceed with ['A', 'B', 'X']
; if X
is also absent, it should proceed with ['A', 'B']
and so on..
I was able to achieve partial result with dataframe merge. For example:
main_df = main_df.merge(map_df, how='left', on=['A', 'B', 'X', 'Y']))
But I am unable to figure out the priority-based angle here.
As a result of mapping, the dataframe(result_df) should be like:
A B X Y Id1 Id2
0 cat cat1 catabc 0.1 uuid01 nx02
1 cat cat1 catxyz 0.4 uuid02 nx01
2 cat cat2 catpqr 0.5 uuid01 nx03
3 dog dog1 dogxyz 0.3 uuid03 nx04
4 dog dog2 dogpqr 0.2 uuid02 nx05
5 dog dog2 dogabc 0.8 uuid01 nx05
Upvotes: 4
Views: 283
Reputation: 31166
Using concepts of how an SQL engine can work with complex query. Generate a partial Cartesian product then filter down results to required set.
In this case a calculated match (weight)
result_df = (main_df
# minimum match - just A, effectivaly a partial catersian producr
.merge(map_df, on=["A"], suffixes=("","_m"))
# calculate weight of match. NaN is neutral, different penalise, same reward
.assign(
B_w=lambda dfa: np.where(dfa["B_m"].isna(), 0, np.where(dfa["B"]==dfa["B_m"],1,-1)*1000),
X_w=lambda dfa: np.where(dfa["X_m"].isna(), 0, np.where(dfa["X"]==dfa["X_m"],1,-1)*100),
Y_w=lambda dfa: np.where(dfa["Y_m"].isna(), 0, np.where(dfa["Y"]==dfa["Y_m"],1,-1)*10),
w=lambda dfa: dfa.loc[:,["B_w","X_w","Y_w"]].sum(axis=1)
)
# biggest weight is one we want
.sort_values(["A","B","X","Y","w"], ascending=[True,True,True,True,False])
.groupby(["A","B","X","Y"]).first()
# # cleanup and don't show workings...
.reset_index()
.loc[:,["A","B","X","Y","Id1","Id2"]]
)
In this case it's the longest available key.
main_df = pd.read_csv(io.StringIO(""" A B X Y Id1
0 cat cat1 catabc 0.1 uuid01
1 cat cat1 catxyz 0.4 uuid02
2 cat cat2 catpqr 0.5 uuid01
3 dog dog1 dogxyz 0.3 uuid03
4 dog dog2 dogpqr 0.2 uuid02
5 dog dog2 dogabc 0.8 uuid01"""), sep="\s+")
map_df = pd.read_csv(io.StringIO(""" A B X Y Id2
0 cat cat1 catxyz 0.4 nx01
1 cat cat1 NaN NaN nx02
2 cat NaN NaN NaN nx03
3 dog dog1 dogxyz NaN nx04
4 dog NaN NaN NaN nx05"""), sep="\s+")
# result_df =
result_df = (main_df
# minimum match - just A, effectively a partial catersian product
.merge(map_df, on=["A"], suffixes=("","_m"))
# simpler to use empty string for NaN
.fillna("")
# synthetic column of rest of key and length
.assign(c=lambda dfa: dfa["B_m"]+dfa["X_m"]+dfa["Y_m"].astype(str),
l=lambda dfa: dfa["c"].str.len())
# longest synthetic key is one we want
.sort_values(["A","B","X","Y","l"], ascending=[True,True,True,True,False])
.groupby(["A","B","X","Y"]).first()
# cleanup and don't show workings...
.reset_index()
.loc[:,["A","B","X","Y","Id1","Id2"]]
)
A B X Y Id1 Id2
0 cat cat1 catabc 0.1 uuid01 nx02
1 cat cat1 catxyz 0.4 uuid02 nx01
2 cat cat2 catpqr 0.5 uuid01 nx03
3 dog dog1 dogxyz 0.3 uuid03 nx04
4 dog dog2 dogabc 0.8 uuid01 nx05
5 dog dog2 dogpqr 0.2 uuid02 nx05
Upvotes: 1
Reputation: 8391
This would solve. Not very pretty though.
It is an iterative algorithm that traverses in forward and then in backward direction your data.
The forward pass resolves the merges with progressively lower priority, eliminating the matches from the data structure. It uses an inner merge
strategy.
The backward pass updates the output from lowest to highest priority.
The final update
call is a bit slow, I noticed.
merge_on = ['A', 'B', 'X', 'Y']
tot_cols = len(merge_on)
operation_main_df = main_df.copy()
outputs = []
# forward pass on progressively smaller sets
for first_idx in range(len(merge_on)):
merged_id2 = operation_main_df.merge(map_df, how='inner',
on=merge_on[0:tot_cols-first_idx],right_index=True,suffixes={"","_y"})
# the relevant output has the right number of NaN columns, the rest is garbage.
outputs.append(merged_id2[merged_id2.isna().sum(axis=1)==first_idx])
# backward updating pass
reverse_it = iter(outputs[::-1])
out = next(reverse_it)[['A','B','X','Y','Id1','Id2']].copy()
for el in reverse_it:
out.update(el)
output:
A B X Y Id1 Id2
0 cat cat1 catabc 0.1 uuid01 nx02
1 cat cat1 catxyz 0.4 uuid02 nx01
2 cat cat2 catpqr 0.5 uuid01 nx03
3 dog dog1 dogxyz 0.3 uuid03 nx04
4 dog dog2 dogpqr 0.2 uuid02 nx05
5 dog dog2 dogabc 0.8 uuid01 nx05
EDIT: accelerates of a 10x factor on my machine
# change
out = next(reverse_it)[['A','B','X','Y','Id1','Id2']]
# into
out = next(reverse_it)[['A','B','X','Y','Id1','Id2']].copy()
Upvotes: 1