Reputation: 629
I have a big dataframe that contains around 7,000,000 rows of time series data that looks like this
timestamp | values
2019-08-01 14:53:01 | 20.0
2019-08-01 14:53:55 | 29.0
2019-08-01 14:53:58 | 22.4
...
2019-08-02 14:53:25 | 27.9
I want to create a column that is a lag version of 1 day for each row, since my timestamps don't match up perfectly, I can't use the normal shift()
method.
The result will be something like this:
timestamp | values | lag
2019-08-01 14:53:01 | 20.0 | Nan
2019-08-01 14:53:55 | 29.0 | Nan
2019-08-01 14:53:58 | 22.4 | Nan
...
2019-08-02 14:53:25 | 27.9 | 20.0
I found some posts related to get the closest timestamp to a given time: Find closest row of DataFrame to given time in Pandas and tried the methods, it does the job but takes too long to run, here's what I have:
def get_nearest(data, timestamp):
index = data.index.get_loc(timestamp,"nearest")
return data.iloc[index, 0]
df['lag'] = [get_nearest(df, dt) for dt in df.index]
Any efficient ways to solve the problem?
Upvotes: 6
Views: 7304
Reputation: 798
Hmmmm, not sure if this will work out to be more efficient, but merge_asof is an approach worth looking at as won't require a udf.
df['date'] = df.timestamp.dt.date
df2 = df.copy()
df2['date'] = df2['date'] + pd.to_timedelta(1,unit ='D')
df2['timestamp'] = df2['timestamp'] + pd.to_timedelta(1,unit ='D')
pd.merge_asof(df,df2, on = 'timestamp', by = 'date', direction = 'nearest')
The approach essentially merges the previous day value to the next day and then matches to the nearest timestamp.
Upvotes: 2
Reputation: 86310
Assuming your dates are sorted, one way to do this quickly would be to use pd.DateTimeIndex.searchsorted
to find all the matching dates in O[N log N]
time.
Creating some test data, it might look something like this:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(
{'values': np.random.rand(10)},
index=sorted(np.random.choice(pd.date_range('2019-08-01', freq='T', periods=10000), 10, replace=False))
)
def add_lag(df):
ind = df.index.searchsorted(df.index - pd.DateOffset(1))
out_of_range = (ind <= 0) | (ind >= df.shape[0])
ind[out_of_range] = 0
lag = df['values'].values[ind]
lag[out_of_range] = np.nan
df['lag'] = lag
return df
add_lag(df)
values lag
2019-08-01 06:17:00 0.548814 NaN
2019-08-01 10:51:00 0.715189 NaN
2019-08-01 13:56:00 0.602763 NaN
2019-08-02 09:50:00 0.544883 0.715189
2019-08-03 14:06:00 0.423655 0.423655
2019-08-04 03:00:00 0.645894 0.423655
2019-08-05 07:40:00 0.437587 0.437587
2019-08-07 00:41:00 0.891773 0.891773
2019-08-07 07:05:00 0.963663 0.891773
2019-08-07 15:55:00 0.383442 0.891773
With this approach, a dataframe with 1 million rows can be computed in tens of milliseconds:
df = pd.DataFrame(
{'values': np.random.rand(1000000)},
index=sorted(np.random.choice(pd.date_range('2019-08-01', freq='T', periods=10000000), 1000000, replace=False))
)
%timeit add_lag(df)
# 10 loops, best of 3: 71.5 ms per loop
Note however that this doesn't find the nearest value to a lag of one day, but the nearest value after a lag of one day. If you want the nearest value in either direction, you'll need to modify this approach.
Upvotes: 1