Reputation: 1816
I have a dataframe with two sorted columns of integers.
A B
0 17 15
1 18 18
2 19 20
3 20 21
4 22 21
5 23 27
For all elements of B, I want to find the index of the closest matching element of A:
A B closest_match_idx
0 17 15 0
1 18 18 1
2 19 20 3
3 20 21 3
4 22 21 3
5 23 27 5
I know I can do
df['closest_match_idx'] = df.B.map(lambda x: (df.A - x).abs().idxmin()))
but that's an O(N**2) solution for a problem that is clearly O(N). I couldn't find any better solution other than rolling my own index-finding function, but this feels like a problem that has an existing solution. Ideas?
For context, what I'm eventually trying to do is find the closest matching element in A for each element of B, up to a maximum absolute difference (otherwise just use the value from B):
match_diff = df.B - df.A.iloc[df['closest_match_idx']]
df['output'] = B
replace_idxs = np.where(diff.abs() <= 2)
df['output'].iloc[replace_idxs] = df['A'].iloc[replace_idxs]
searchsorted or this index trick is almost there, but not quite.
A B closest_match_idx match_diff output
0 17 15 0 2 17
1 18 18 1 0 18
2 19 20 3 1 20
3 20 21 3 1 20
4 22 21 3 1 20
5 23 27 5 4 23
I should also note that in my example A and B have the same length, but I'd like a solution that generalizes to series of different lengths.
Upvotes: 2
Views: 64
Reputation: 59549
You can use merge_asof
. This requires sorting the frames. This has the benefit that it supports a tolerance
argument allowing you to specify a caliper within which you consider matches.
I'll leave in the additional 'A_match'
column, but you can drop it if you don't need it.
res = pd.merge_asof(df.sort_values('B'),
df.rename_axis(index='closest_idx').reset_index().drop(columns='B').sort_values('A'),
left_on='B', right_on='A',
direction='nearest',
suffixes=['', '_match'])
print(res)
A B closest_idx A_match
0 17 15 0 17
1 18 18 1 18
2 19 20 3 20
3 20 21 3 20
4 22 21 3 20
5 23 27 5 23
Setting a tolerance of |distance| <= 1
res = pd.merge_asof(df.sort_values('B'),
df.rename_axis(index='closest_idx').reset_index().drop(columns='B').sort_values('A'),
left_on='B', right_on='A',
direction='nearest',
suffixes=['', '_match'],
tolerance=1)
# A B closest_idx A_match
#0 17 15 NaN NaN
#1 18 18 1.0 18.0
#2 19 20 3.0 20.0
#3 20 21 3.0 20.0
#4 22 21 3.0 20.0
#5 23 27 NaN NaN
Upvotes: 4