Reputation: 12273
I have two dataframes, say A
and B
, that have some columns named attr1
, attr2
, attrN
.
I have a certain distance measure, and I would like to merge the dataframes, such that each row in A
is merged with the row in B
that has the shortest distance between attributes. Note that rows in B
can be repeated when merging.
For example (with one attribute to keep things simple), merging these two tables using absolute difference distance |A.attr1 - B.att1|
A | attr1 B | attr1
0 | 10 0 | 15
1 | 20 1 | 27
2 | 30 2 | 80
should yield the following merged table
M | attr1_A attr1_B
0 | 10 15
1 | 20 15
2 | 30 27
My current way of doing this is slow and is based on comparing each row of A
with each row of B
, but code is also not clear because I have to preserve indices for merging and I am not satisfied at all, but I cannot come up with a better solution.
How can I perform the merge as above using pandas? Are there any convenience methods or functions that can be helpful here?
EDIT: Just to clarify, in the dataframes there are also other columns which are not used in the distance calculation, but have to be merged as well.
Upvotes: 5
Views: 3618
Reputation: 153510
One way you could do it as follows:
A = pd.DataFrame({'attr1':[10,20,30]})
B = pd.DataFrame({'attr1':[15,15,27]})
Use a cross join to get all combinations
Update for 1.2+ pandas use how='cross'
merge_AB = A.merge(B, how='cross', suffixes = ('_A', '_B'))
Older pandas version use psuedo key...
A = A.assign(key=1)
B = B.assign(key=1)
merged_AB =pd.merge(A,B, on='key',suffixes=('_A','_B'))
Now let's find the min distances in merged_AB
M = merged_AB.groupby('attr1_A').apply(lambda x:abs(x['attr1_A']-x['attr1_B'])==abs(x['attr1_A']-x['attr1_B']).min())
merged_AB[M.values].drop_duplicates().drop('key',axis=1)
Output:
attr1_A attr1_B
0 10 15
3 20 15
8 30 27
Upvotes: 11