Reputation: 17
I have two dataframes
import numpy as np
import pandas as pd
test1 = pd.date_range(start='1/1/2018', end='1/10/2018')
test1 = pd.DataFrame(test1)
test1.rename(columns = {list(test1)[0]: 'time'}, inplace = True)
test2 = pd.date_range(start='1/5/2018', end='1/20/2018')
test2 = pd.DataFrame(test2)
test2.rename(columns = {list(test2)[0]: 'time'}, inplace = True)
Now in first dataframe I create column
test1['values'] = np.zeros(10)
I want to fill this column, next to each date there should be the index of the closest date from second dataframe. I want it to look like this:
0 2018-01-01 0
1 2018-01-02 0
2 2018-01-03 0
3 2018-01-04 0
4 2018-01-05 0
5 2018-01-06 1
6 2018-01-07 2
7 2018-01-08 3
Of course my real data is not evenly spaced and has minutes and seconds, but the idea is same. I use the following code:
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
for k in range(10):
a = nearest(test2['time'], test1['time'][k]) ### find nearest timestamp from second dataframe
b = test2.index[test2['time'] == a].tolist()[0] ### identify the index of this timestamp
test1['value'][k] = b ### assign this value to the cell
This code is very slow on large datasets, how can I make it more efficient?
P.S. timestamps in my real data are sorted and increasing just like in these artificial examples.
Upvotes: 1
Views: 493
Reputation: 25564
You could do this in one line, using numpy's argmin:
test1['values'] = test1['time'].apply(lambda t: np.argmin(np.absolute(test2['time'] - t)))
Note that applying a lambda function is essentially also a loop. Check if that satisfies your requirements performance-wise.
You might also be able to leverage the fact that your timestamps are sorted and the timedelta between each timestamp is constant (if I got that correctly). Calculate the offset in days and derive the index vector, e.g. as follows:
offset = (test1['time'] - test2['time']).iloc[0].days
if offset < 0: # test1 time starts before test2 time, prepend zeros:
offset = abs(offset)
idx = np.append(np.zeros(offset), np.arange(len(test1['time'])-offset)).astype(int)
else: # test1 time starts after or with test2 time, use arange right away:
idx = np.arange(offset, offset+len(test1['time']))
test1['values'] = idx
Upvotes: 1