Reputation: 416
I have a Pandas dataframe containing timestamp (unevenly spaced), sequence number, category and percent formation. Sequence number is used only to order rows if multiple rows exist with same timestamp and category, and is dropped after sorting.
|----------------------------------------------------------------|
| | seq_no | category | pct_formation |
|----------------------------------------------------------------|
|ts_timestamp | | | |
|----------------------------------------------------------------|
|2018-10-22 10:13:44.043 | 6839262 | in_petr | 37.070000 |
|2018-10-22 10:17:09.527 | 7257908 | in_petr | 36.970000 |
|2018-10-22 10:17:43.977 | 7319000 | in_dsh | 36.950000 |
|2018-10-22 10:17:43.963 | 7318885 | in_dsh | 36.960000 |
|2018-10-22 10:17:09.527 | 7257918 | in_petr | 32.960000 |
|2018-10-22 10:19:44.040 | 7585354 | out_petr | 36.890000 |
|2018-10-22 10:19:44.043 | 7585461 | out_petr | 36.900000 |
|2018-10-22 10:19:37.267 | 7563817 | sync | 33.910000 |
|2018-10-22 10:19:44.057 | 7586045 | sync | 36.960000 |
|2018-10-22 10:19:16.750 | 7516841 | out_petr | 36.880000 |
|2018-10-22 10:20:03.160 | 7637889 | sync | 36.980000 |
|2018-10-22 10:20:32.350 | 7691592 | sync | 37.000000 |
|2018-10-22 10:23:03.150 | 8008804 | sync | 34.580000 |
|2018-10-22 10:22:18.633 | 7907782 | in_dsh | 36.980000 |
|2018-10-22 10:25:39.557 | 8290932 | in_dsh | 36.970000 |
|----------------------------------------------------------------|
I want to get change in pct_formation for each category every five seconds between 10:00:00 and 11:00:00 each day.
So far, I have tried:
df.sort_index()[['category', 'pct_formation']] \
.groupby(['category', df.index.date])
.rolling('5s').pct_formation.mean()
I use group by on date because I suspect if I use timestamp as is then grouping results won't make much sense as timestamp is unevenly spaced and very granular.
How can I get evenly spaced 5-second windows between 10:00:00 and 11:00:00 (example: 10:00:00 to 10:00:05, 10:00:01 to 10:00:06, and so on). And, how do I get the difference in pct_formation between the start and end of each 5 second window?
If I use functions like min() and max() after rolling(), I get some errors like:
ValueError: could not convert string to float: 'out_petr'
TypeError: cannot handle this type -> object
Please guide me on how to proceed, I'd be very grateful. TIA.
EDIT: Adding details based on feedback in comments.
I want a rolling window, so the next window after 10:00:00 to 10:00:05 would be 10:00:01 to 10:00:06, followed by 10:00:02 to 10:00:07, and so on.
I would like to see how much the pct_formation value has changed from one window to another, so if there are multiple values in the same interval, I will use mean()
.
I think I will have to use .resample()
to get evenly spaced intervals between 10am and 11am everyday, but I am finding it very difficult to understand how.
I realized I can create regularly spaced time windows like:
pd.date_range(start=df.index.min().replace(hour=10, minute=0, second=0, microsecond=0),
end=df.index.max().replace(hour=11, minute=0, second=0, microsecond=0),
freq='5S')
However, I do not know how to change my dataframe to conform to these times for each category.
Upvotes: 1
Views: 3601
Reputation: 9019
IIUC, you can use resample()
and rolling()
:
df['ts_timestamp'] = pd.to_datetime(df['ts_timestamp'], format='%Y-%m-%d %H:%M:%S')
resampled = df.groupby('category').apply(lambda x: x.drop_duplicates('ts_timestamp').set_index('ts_timestamp').resample('1S').ffill())
resampled['pct_formation'].rolling(5).apply(lambda x: x[0]-x[-1], raw=True)
Yields (a short sample):
category ts_timestamp
in_dsh 2018-10-22 10:17:43 NaN
2018-10-22 10:17:44 NaN
2018-10-22 10:17:45 NaN
2018-10-22 10:17:46 NaN
2018-10-22 10:17:47 NaN
2018-10-22 10:17:48 0.0
2018-10-22 10:17:49 0.0
2018-10-22 10:17:50 0.0
2018-10-22 10:17:51 0.0
2018-10-22 10:17:52 0.0
2018-10-22 10:17:53 0.0
2018-10-22 10:17:54 0.0
2018-10-22 10:17:55 0.0
...
For the time being, I am simply using ffill()
to fill in the relatively sparse data, but you could also think about interpolating, etc.
Upvotes: 3