Reputation: 2542
I have a large dataframe of the form
timestamp | col1 | col2 ...
I want to select rows spaced out by an interval of at least x minutes, where x can be 5,10,30, etc. The problem is the timestamps arent equally spaced, so I cant do a simple "take every nth row" trick.
Example:
timestamp | col1 | col2
'2019-01-15 17:52:29.955000', x, b
'2019-01-15 17:58:29.531000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 03:27:46.324000', x, b
'2019-01-16 03:33:09.984000', x, b
'2019-01-16 07:22:08.170000', x, b
'2019-01-16 07:28:27.406000', x, b
'2019-01-16 07:34:35.194000', x, b
if interval = 10:
result:
'2019-01-15 17:52:29.955000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 03:33:09.984000', x, b
'2019-01-16 07:22:08.170000', x, b
'2019-01-16 07:34:35.194000', x, b
if interval = 30:
result:
'2019-01-15 17:52:29.955000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 07:22:08.170000', x, b
I could do a brute force n^2 approach, but I'm sure theres a pandas way for this that im missing..
Thank you! :)
EDIT: It is not a duplicate of Calculate time difference between Pandas Dataframe indices just to clarify. I need to subset a dataframe based on a given interval
Upvotes: 4
Views: 1038
Reputation: 150735
Like commented, it looks like you need to do a for
loop. And it is not too bad because you are doing an O(n)
loop:
def sampling(df, thresh):
thresh = pd.to_timedelta(thresh)
time_diff = df.timestamp.diff().fillna(pd.Timedelta(seconds=0))
ret = [0]
running_total = pd.to_timedelta(0)
for i in df.index:
running_total += time_diff[i]
if running_total >= thresh:
ret.append(i)
running_total = pd.to_timedelta(0)
return df.loc[ret].copy()
Then sampling(df, '10T')
gives
timestamp col1 col2
0 2019-01-15 17:52:29.955 x b
2 2019-01-16 03:21:48.255 x b
4 2019-01-16 03:33:09.984 x b
5 2019-01-16 07:22:08.170 x b
7 2019-01-16 07:34:35.194 x b
and sampling(df, '30T')
gives:
timestamp col1 col2
0 2019-01-15 17:52:29.955 x b
2 2019-01-16 03:21:48.255 x b
5 2019-01-16 07:22:08.170 x b
Upvotes: 5