Reputation: 389
I want to create 2 new columns which would give me the closest value and ID to a certain value. This is how my dataframe is structured:
x_time expiration x_price p_time p_price
100 4 55.321 100 21
105 4 51.120 105 25
110 4 44.412 110 33.1
100 5 9.1 100 3.1
105 5 9.5 105 5.1
110 5 8.2 110 12.1
100 6 122.1 100 155.9
105 6 144.1 105 134.2
.......
Essentially, I want to create a new column (called closest_time
& closest_price
) which would be the closest p_price
to the x_price
for that group only (hence the group by expiration)
So, the expected results would look like this:
x_time expiration x_price p_time p_price closest_price closest_p_time
100 4 55.321 100 21 33.1 110
105 4 51.120 105 25 33.1 110
110 4 44.412 110 33.1 33.1 110
100 5 9.1 100 3.1 12.1 110
105 5 9.5 105 5.1 12.1 110
110 5 8.2 110 12.1 5.1 105
100 6 122.1 100 155.9 134.2 105
105 6 144.1 105 134.2 134.22 100
I have thought of a potential way to go about doing this: by looping over the expirations and then parsing through p_price
and comparing all values to each x_price
and selecting the closest one (min(abs())
).
How can I accomplish this with a vectorized dataframe operation?
Upvotes: 5
Views: 3064
Reputation: 3632
I'm not entirely sure if your expected results is right - the last row seems off, 144.1 is closer to 134.2 than 155.9, so the closest_p_time
should be 105?
I do this by first building a lookup of what row contains the closest value, then just selecting the price and time for those rows.
import io
d = """
x_time expiration x_price p_time p_price
100 4 55.321 100 21
105 4 51.120 105 25
110 4 44.412 110 33.1
100 5 9.1 100 3.1
105 5 9.5 105 5.1
110 5 8.2 110 12.1
100 6 122.1 100 155.9
105 6 144.1 105 134.2
"""
df = pd.read_csv(io.StringIO(d), delim_whitespace=True)
idx_lookup = df.apply(lambda x: (df.loc[df['expiration'] == x['expiration'],
'p_price'] - x['x_price']).abs().idxmin(), axis=1)
df['closest_p_price'] = df.loc[idx_lookup, 'p_price'].values
df['closest_p_time'] = df.loc[idx_lookup, 'p_time'].values
Which gives the dataframe
x_time expiration x_price p_time p_price closest_p_price closest_p_time
0 100 4 55.321 100 21.0 33.1 110
1 105 4 51.120 105 25.0 33.1 110
2 110 4 44.412 110 33.1 33.1 110
3 100 5 9.100 100 3.1 12.1 110
4 105 5 9.500 105 5.1 12.1 110
5 110 5 8.200 110 12.1 5.1 105
6 100 6 122.100 100 155.9 134.2 105
7 105 6 144.100 105 134.2 134.2 105
Note this uses apply along axis=1
, which essentially goes through each row. I have not yet thought of a way to do this any better.
Upvotes: 2
Reputation: 23099
Okay, my attempt. I'm using a self-defined function which I've taken from this post which turns a column into an array based of a value. I then use groupby and transform to run over your intended arrays to find your closest match.
def find_nearest(array, values):
array = np.asarray(array)
# the last dim must be 1 to broadcast in (array - values) below.
values = np.expand_dims(values, axis=-1)
indices = np.abs(array - values).argmin(axis=-1)
return array[indices]
## lets use a lambda with transform to assign this back to our main df
df['closest_price'] = df.groupby('expiration')['x_price'].transform(lambda x : find_nearest(df['p_price'],x))
df['closest_time'] = df.groupby('expiration')['x_time'].transform(lambda x : find_nearest(df['p_time'],x))
print(df)
x_time expiration x_price p_time p_price closest_price closest_time
0 100 4 55.321 100 21.0 33.1 100
1 105 4 51.120 105 25.0 33.1 105
2 110 4 44.412 110 33.1 33.1 110
3 100 5 9.100 100 3.1 12.1 100
4 105 5 9.500 105 5.1 12.1 105
5 110 5 8.200 110 12.1 5.1 110
6 100 6 122.100 100 155.9 134.2 100
7 105 6 144.100 105 134.2 134.2 105
Upvotes: 1
Reputation: 1265
I think a nice solution is this:
df['closest_price'] = \
df.apply(lambda x: df[df.p_price <= x.x_price]['p_price'].max(), axis=1)
Upvotes: 4