Rahul
Rahul

Reputation: 416

Pandas dataframe rolling difference in value for 5 second intervals per group

I have a Pandas dataframe containing timestamp (unevenly spaced), sequence number, category and percent formation. Sequence number is used only to order rows if multiple rows exist with same timestamp and category, and is dropped after sorting.

|----------------------------------------------------------------|
|                        | seq_no   | category   | pct_formation |
|----------------------------------------------------------------|
|ts_timestamp            |          |            |               |
|----------------------------------------------------------------|
|2018-10-22 10:13:44.043 | 6839262  | in_petr    | 37.070000     |
|2018-10-22 10:17:09.527 | 7257908  | in_petr    | 36.970000     |
|2018-10-22 10:17:43.977 | 7319000  | in_dsh     | 36.950000     |
|2018-10-22 10:17:43.963 | 7318885  | in_dsh     | 36.960000     |
|2018-10-22 10:17:09.527 | 7257918  | in_petr    | 32.960000     |
|2018-10-22 10:19:44.040 | 7585354  | out_petr   | 36.890000     |
|2018-10-22 10:19:44.043 | 7585461  | out_petr   | 36.900000     |
|2018-10-22 10:19:37.267 | 7563817  | sync       | 33.910000     |
|2018-10-22 10:19:44.057 | 7586045  | sync       | 36.960000     |
|2018-10-22 10:19:16.750 | 7516841  | out_petr   | 36.880000     |
|2018-10-22 10:20:03.160 | 7637889  | sync       | 36.980000     |
|2018-10-22 10:20:32.350 | 7691592  | sync       | 37.000000     |
|2018-10-22 10:23:03.150 | 8008804  | sync       | 34.580000     |
|2018-10-22 10:22:18.633 | 7907782  | in_dsh     | 36.980000     |
|2018-10-22 10:25:39.557 | 8290932  | in_dsh     | 36.970000     |
|----------------------------------------------------------------|

I want to get change in pct_formation for each category every five seconds between 10:00:00 and 11:00:00 each day.

So far, I have tried:

df.sort_index()[['category', 'pct_formation']] \
.groupby(['category', df.index.date])
.rolling('5s').pct_formation.mean()

I use group by on date because I suspect if I use timestamp as is then grouping results won't make much sense as timestamp is unevenly spaced and very granular.

How can I get evenly spaced 5-second windows between 10:00:00 and 11:00:00 (example: 10:00:00 to 10:00:05, 10:00:01 to 10:00:06, and so on). And, how do I get the difference in pct_formation between the start and end of each 5 second window?

If I use functions like min() and max() after rolling(), I get some errors like:

ValueError: could not convert string to float: 'out_petr'
TypeError: cannot handle this type -> object

Please guide me on how to proceed, I'd be very grateful. TIA.

EDIT: Adding details based on feedback in comments.

I want a rolling window, so the next window after 10:00:00 to 10:00:05 would be 10:00:01 to 10:00:06, followed by 10:00:02 to 10:00:07, and so on.

I would like to see how much the pct_formation value has changed from one window to another, so if there are multiple values in the same interval, I will use mean().

I think I will have to use .resample() to get evenly spaced intervals between 10am and 11am everyday, but I am finding it very difficult to understand how.

I realized I can create regularly spaced time windows like:

pd.date_range(start=df.index.min().replace(hour=10, minute=0, second=0, microsecond=0),
              end=df.index.max().replace(hour=11, minute=0, second=0, microsecond=0),
              freq='5S')

However, I do not know how to change my dataframe to conform to these times for each category.

Upvotes: 1

Views: 3601

Answers (1)

rahlf23
rahlf23

Reputation: 9019

IIUC, you can use resample() and rolling():

df['ts_timestamp'] = pd.to_datetime(df['ts_timestamp'], format='%Y-%m-%d %H:%M:%S')

resampled = df.groupby('category').apply(lambda x: x.drop_duplicates('ts_timestamp').set_index('ts_timestamp').resample('1S').ffill())

resampled['pct_formation'].rolling(5).apply(lambda x: x[0]-x[-1], raw=True)

Yields (a short sample):

category  ts_timestamp       
in_dsh    2018-10-22 10:17:43    NaN
          2018-10-22 10:17:44    NaN
          2018-10-22 10:17:45    NaN
          2018-10-22 10:17:46    NaN
          2018-10-22 10:17:47    NaN
          2018-10-22 10:17:48    0.0
          2018-10-22 10:17:49    0.0
          2018-10-22 10:17:50    0.0
          2018-10-22 10:17:51    0.0
          2018-10-22 10:17:52    0.0
          2018-10-22 10:17:53    0.0
          2018-10-22 10:17:54    0.0
          2018-10-22 10:17:55    0.0
...

For the time being, I am simply using ffill() to fill in the relatively sparse data, but you could also think about interpolating, etc.

Upvotes: 3

Related Questions