yungpadewon
yungpadewon

Reputation: 389

Finding closest value while grouping by a column

I want to create 2 new columns which would give me the closest value and ID to a certain value. This is how my dataframe is structured:

x_time    expiration    x_price    p_time    p_price
 100          4          55.321     100        21
 105          4          51.120     105        25
 110          4          44.412     110        33.1
 100          5           9.1       100        3.1
 105          5           9.5       105        5.1
 110          5           8.2       110        12.1 
 100          6           122.1     100        155.9
 105          6           144.1     105        134.2 
 .......

Essentially, I want to create a new column (called closest_time & closest_price) which would be the closest p_price to the x_price for that group only (hence the group by expiration)

So, the expected results would look like this:

x_time    expiration    x_price    p_time    p_price   closest_price closest_p_time
 100          4          55.321     100        21           33.1       110
 105          4          51.120     105        25           33.1       110
 110          4          44.412     110        33.1         33.1       110
 100          5           9.1       100        3.1          12.1       110
 105          5           9.5       105        5.1          12.1       110
 110          5           8.2       110        12.1          5.1       105
 100          6           122.1     100        155.9       134.2       105
 105          6           144.1     105        134.2       134.22      100

I have thought of a potential way to go about doing this: by looping over the expirations and then parsing through p_price and comparing all values to each x_price and selecting the closest one (min(abs())).

How can I accomplish this with a vectorized dataframe operation?

Upvotes: 5

Views: 3064

Answers (3)

Ken Syme
Ken Syme

Reputation: 3632

I'm not entirely sure if your expected results is right - the last row seems off, 144.1 is closer to 134.2 than 155.9, so the closest_p_time should be 105?

I do this by first building a lookup of what row contains the closest value, then just selecting the price and time for those rows.

import io
    
d = """
x_time    expiration    x_price    p_time    p_price
 100          4          55.321     100        21
 105          4          51.120     105        25
 110          4          44.412     110        33.1
 100          5           9.1       100        3.1
 105          5           9.5       105        5.1
 110          5           8.2       110        12.1 
 100          6           122.1     100        155.9
 105          6           144.1     105        134.2 
"""
    
df = pd.read_csv(io.StringIO(d), delim_whitespace=True)

idx_lookup = df.apply(lambda x: (df.loc[df['expiration'] == x['expiration'], 
'p_price'] - x['x_price']).abs().idxmin(), axis=1)
    
df['closest_p_price'] = df.loc[idx_lookup, 'p_price'].values
df['closest_p_time'] = df.loc[idx_lookup, 'p_time'].values

Which gives the dataframe

   x_time  expiration  x_price  p_time  p_price  closest_p_price  closest_p_time
0     100           4   55.321     100     21.0             33.1             110
1     105           4   51.120     105     25.0             33.1             110
2     110           4   44.412     110     33.1             33.1             110
3     100           5    9.100     100      3.1             12.1             110
4     105           5    9.500     105      5.1             12.1             110
5     110           5    8.200     110     12.1              5.1             105
6     100           6  122.100     100    155.9            134.2             105
7     105           6  144.100     105    134.2            134.2             105

Note this uses apply along axis=1, which essentially goes through each row. I have not yet thought of a way to do this any better.

Upvotes: 2

Umar.H
Umar.H

Reputation: 23099

Okay, my attempt. I'm using a self-defined function which I've taken from this post which turns a column into an array based of a value. I then use groupby and transform to run over your intended arrays to find your closest match.

def find_nearest(array, values):
array = np.asarray(array)

    # the last dim must be 1 to broadcast in (array - values) below.
    values = np.expand_dims(values, axis=-1) 

    indices = np.abs(array - values).argmin(axis=-1)
    return array[indices]

    ## lets use a lambda with transform to assign this back to our main df 
    df['closest_price'] = df.groupby('expiration')['x_price'].transform(lambda x : find_nearest(df['p_price'],x))
    df['closest_time'] = df.groupby('expiration')['x_time'].transform(lambda x : find_nearest(df['p_time'],x))


     print(df)
    x_time  expiration  x_price p_time  p_price closest_price   closest_time
0   100     4   55.321      100 21.0    33.1    100
1   105     4   51.120      105 25.0    33.1    105
2   110     4   44.412      110 33.1    33.1    110
3   100     5   9.100       100 3.1     12.1    100
4   105     5   9.500       105 5.1     12.1    105
5   110     5   8.200       110 12.1    5.1     110
6   100     6   122.100     100 155.9   134.2   100
7   105     6   144.100     105 134.2   134.2   105

Upvotes: 1

tvgriek
tvgriek

Reputation: 1265

I think a nice solution is this:

df['closest_price'] = \
    df.apply(lambda x: df[df.p_price <= x.x_price]['p_price'].max(), axis=1)

Upvotes: 4

Related Questions