cydonian
cydonian

Reputation: 1816

Pandas: For all elements of sorted series B, find index of closest element in sorted series A

I have a dataframe with two sorted columns of integers.

      A        B
0     17       15
1     18       18
2     19       20
3     20       21
4     22       21
5     23       27

For all elements of B, I want to find the index of the closest matching element of A:

      A        B       closest_match_idx
0     17       15      0
1     18       18      1
2     19       20      3
3     20       21      3
4     22       21      3
5     23       27      5

I know I can do

df['closest_match_idx'] = df.B.map(lambda x: (df.A - x).abs().idxmin()))

but that's an O(N**2) solution for a problem that is clearly O(N). I couldn't find any better solution other than rolling my own index-finding function, but this feels like a problem that has an existing solution. Ideas?

For context, what I'm eventually trying to do is find the closest matching element in A for each element of B, up to a maximum absolute difference (otherwise just use the value from B):

match_diff = df.B - df.A.iloc[df['closest_match_idx']]
df['output'] = B
replace_idxs = np.where(diff.abs() <= 2)
df['output'].iloc[replace_idxs] = df['A'].iloc[replace_idxs]

searchsorted or this index trick is almost there, but not quite.

      A        B       closest_match_idx   match_diff    output
0     17       15      0                   2             17
1     18       18      1                   0             18
2     19       20      3                   1             20
3     20       21      3                   1             20
4     22       21      3                   1             20
5     23       27      5                   4             23

I should also note that in my example A and B have the same length, but I'd like a solution that generalizes to series of different lengths.

Upvotes: 2

Views: 64

Answers (1)

ALollz
ALollz

Reputation: 59549

You can use merge_asof. This requires sorting the frames. This has the benefit that it supports a tolerance argument allowing you to specify a caliper within which you consider matches.

I'll leave in the additional 'A_match' column, but you can drop it if you don't need it.

res = pd.merge_asof(df.sort_values('B'), 
                    df.rename_axis(index='closest_idx').reset_index().drop(columns='B').sort_values('A'),
                    left_on='B', right_on='A',
                    direction='nearest',
                    suffixes=['', '_match'])

print(res)

    A   B  closest_idx  A_match
0  17  15            0       17
1  18  18            1       18
2  19  20            3       20
3  20  21            3       20
4  22  21            3       20
5  23  27            5       23

Setting a tolerance of |distance| <= 1

res = pd.merge_asof(df.sort_values('B'), 
                    df.rename_axis(index='closest_idx').reset_index().drop(columns='B').sort_values('A'),
                    left_on='B', right_on='A',
                    direction='nearest',
                    suffixes=['', '_match'],
                    tolerance=1)

#    A   B  closest_idx  A_match
#0  17  15          NaN      NaN
#1  18  18          1.0     18.0
#2  19  20          3.0     20.0
#3  20  21          3.0     20.0
#4  22  21          3.0     20.0
#5  23  27          NaN      NaN

Upvotes: 4

Related Questions