student010101
student010101

Reputation: 197

Computing column in dataframe based on nearest value in another dataframe

I have two dataframes coarse and fine. fine contains 3 columns start_time, end_time, start_price, end_price. coarse contains start_time, end_time. Note that all the times are pandas timestamp objects, e.g., 2016-12-12 01:03:13.15231+00:00

For coarse I need to add 2 additional columns start_price, end_price, where coarse.start_price is the the fine.start_price that has a fine.start_time closest to coarse.start_time, and analogously so for end_price.

Here's an example for clarity:

coarse:

start_time                              end_time
2016-12-12 01:00:00.000+00:00           2016-12-12 02:00:00.000+00:00
2016-12-12 02:00:00.000+00:00           2016-12-12 03:00:00.000+00:00
2016-12-12 03:00:00.000+00:00           2016-12-12 03:30:00.000+00:00
fine:

start_time                              end_time                         start_price 
2016-12-12 00:59:00.000+00:00           2016-12-12 01:12:00.000+00:00    2.3
2016-12-12 01:12:00.000+00:00           2016-12-12 01:15:00.000+00:00    4.5
2016-12-12 01:15:00.000+00:00           2016-12-12 01:45:00.000+00:00    5.7
2016-12-12 01:45:00.000+00:00           2016-12-12 01:55:00.000+00:00    8.8
2016-12-12 01:55:00.000+00:00           2016-12-12 02:15:00.000+00:00    9.9
2016-12-12 02:15:00.000+00:00           2016-12-12 02:16:00.000+00:00    11.2
2016-12-12 02:16:00.000+00:00           2016-12-12 02:31:00.000+00:00    13.5
2016-12-12 02:31:00.000+00:00           2016-12-12 02:45:00.000+00:00    14.8
2016-12-12 02:45:00.000+00:00           2016-12-12 02:59:00.000+00:00    15.9
2016-12-12 02:59:00.000+00:00           2016-12-12 03:31:00.000+00:00    16.0

So the result should be

coarse:
 
start_time                              end_time                         start_price
2016-12-12 01:00:00.000+00:00           2016-12-12 02:00:00.000+00:00    2.3
2016-12-12 02:00:00.000+00:00           2016-12-12 03:00:00.000+00:00    9.9
2016-12-12 03:00:00.000+00:00           2016-12-12 03:30:00.000+00:00    16.0

(I skipped end_price because it's the same logic). I am wondering what is the easiest way to do this?

Additionally, rather than using the nearest neighbor, I may consider interpolating the price instead based on the times. Is there an easy way to do this as well?

Upvotes: 0

Views: 47

Answers (1)

Chris
Chris

Reputation: 16172

You may be looking for pd.merge_asof which is perfect for joining on times that don't match exactly. There is a direction parameter which you can adjust to cause the match to go the direction you want.

import pandas as pd

coarse = pd.DataFrame({'start_time': ['2016-12-12 01:00:00.000+00:00', '2016-12-12 02:00:00.000+00:00', '2016-12-12 03:00:00.000+00:00'], 'end_time': ['2016-12-12 02:00:00.000+00:00', '2016-12-12 03:00:00.000+00:00', '2016-12-12 03:30:00.000+00:00']} )

fine = pd.DataFrame({'start_time': ['2016-12-12 00:59:00.000+00:00', '2016-12-12 01:12:00.000+00:00', '2016-12-12 01:15:00.000+00:00', '2016-12-12 01:45:00.000+00:00', '2016-12-12 01:55:00.000+00:00', '2016-12-12 02:15:00.000+00:00', '2016-12-12 02:16:00.000+00:00', '2016-12-12 02:31:00.000+00:00', '2016-12-12 02:45:00.000+00:00', '2016-12-12 02:59:00.000+00:00'], 'end_time': ['2016-12-12 01:12:00.000+00:00', '2016-12-12 01:15:00.000+00:00', '2016-12-12 01:45:00.000+00:00', '2016-12-12 01:55:00.000+00:00', '2016-12-12 02:15:00.000+00:00', '2016-12-12 02:16:00.000+00:00', '2016-12-12 02:31:00.000+00:00', '2016-12-12 02:45:00.000+00:00', '2016-12-12 02:59:00.000+00:00', '2016-12-12 03:31:00.000+00:00'], 'start_price': [2.3, 4.5, 5.7, 8.8, 9.9, 11.2, 13.5, 14.8, 15.9, 16.0]})

coarse['start_time'] = pd.to_datetime(coarse['start_time'])
fine['start_time'] = pd.to_datetime(fine['start_time'])
coarse['end_time'] = pd.to_datetime(coarse['end_time'])
fine['end_time'] = pd.to_datetime(fine['end_time'])


coarse = pd.merge_asof(coarse, fine[['start_time','start_price']],
                       on='start_time',
                       direction='nearest')

coarse = pd.merge_asof(coarse,
                       fine[['end_time','start_price']].rename(columns={'start_price':'end_price'}),
                       on='end_time',
                       direction='nearest')

Output

                 start_time                  end_time  start_price  end_price
0 2016-12-12 01:00:00+00:00 2016-12-12 02:00:00+00:00          2.3        8.8
1 2016-12-12 02:00:00+00:00 2016-12-12 03:00:00+00:00          9.9       15.9
2 2016-12-12 03:00:00+00:00 2016-12-12 03:30:00+00:00         16.0       16.0

Upvotes: 1

Related Questions