Wboy
Wboy

Reputation: 2542

pandas select rows with given timestamp interval

I have a large dataframe of the form

timestamp | col1 | col2 ...

I want to select rows spaced out by an interval of at least x minutes, where x can be 5,10,30, etc. The problem is the timestamps arent equally spaced, so I cant do a simple "take every nth row" trick.

Example:

timestamp | col1 | col2

'2019-01-15 17:52:29.955000', x, b
'2019-01-15 17:58:29.531000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 03:27:46.324000', x, b
'2019-01-16 03:33:09.984000', x, b
'2019-01-16 07:22:08.170000', x, b
'2019-01-16 07:28:27.406000', x, b
'2019-01-16 07:34:35.194000', x, b

if interval = 10:

result:

'2019-01-15 17:52:29.955000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 03:33:09.984000', x, b
'2019-01-16 07:22:08.170000', x, b
'2019-01-16 07:34:35.194000', x, b

if interval = 30:

result:

'2019-01-15 17:52:29.955000', x, b
'2019-01-16 03:21:48.255000', x, b
'2019-01-16 07:22:08.170000', x, b

I could do a brute force n^2 approach, but I'm sure theres a pandas way for this that im missing..

Thank you! :)

EDIT: It is not a duplicate of Calculate time difference between Pandas Dataframe indices just to clarify. I need to subset a dataframe based on a given interval

Upvotes: 4

Views: 1038

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150735

Like commented, it looks like you need to do a for loop. And it is not too bad because you are doing an O(n) loop:

def sampling(df, thresh):
    thresh = pd.to_timedelta(thresh)
    time_diff = df.timestamp.diff().fillna(pd.Timedelta(seconds=0))
    ret = [0]
    running_total = pd.to_timedelta(0)
    for i in df.index:
        running_total += time_diff[i]
        if running_total >= thresh:
            ret.append(i)
            running_total = pd.to_timedelta(0)

    return df.loc[ret].copy()

Then sampling(df, '10T') gives

                timestamp col1 col2
0 2019-01-15 17:52:29.955    x    b
2 2019-01-16 03:21:48.255    x    b
4 2019-01-16 03:33:09.984    x    b
5 2019-01-16 07:22:08.170    x    b
7 2019-01-16 07:34:35.194    x    b

and sampling(df, '30T') gives:

                timestamp col1 col2
0 2019-01-15 17:52:29.955    x    b
2 2019-01-16 03:21:48.255    x    b
5 2019-01-16 07:22:08.170    x    b

Upvotes: 5

Related Questions