jBrewing
jBrewing

Reputation: 21

How to resample/reindex/groupby a time series based on a column's data?

SO I've got a pandas data frame that contains 2 values of water use at a 1 second resolution. The values are "hotIn" and "hotOut". The hotIn can record down to the tenth of a gallon at a one second resolution while the hotOut records whole number pulses representing a gallon, i.e. when a pulse occurs, one gallon has passed through the meter. The pulses occur roughly every 14-15 seconds.

Data looks roughly like this:

Index                       hotIn(gpm)      hotOut(pulse=1gal) 
2019-03-23T00:00:00         4               0 
2019-03-23T00:00:01         5               0 
2019-03-23T00:00:02         4               0 
2019-03-23T00:00:03         4               0 
2019-03-23T00:00:04         3               0 
2019-03-23T00:00:05         4               1 
2019-03-23T00:00:06         4               0 
2019-03-23T00:00:07         5               0 
2019-03-23T00:00:08         3               0 
2019-03-23T00:00:09         3               0 
2019-03-23T00:00:10         4               0 
2019-03-23T00:00:11         4               0 
2019-03-23T00:00:12         5               0 
2019-03-23T00:00:13         5               1

What I'm trying to do is resample or reindex the data frame based on the occurrence of pulses and sum the hotIn between the new timestamps.

For example, sum the hotIn between 00:00:00 - 00:00:05 and 00:00:06 - 00:00:13.

Results would ideally look like this:

Index                       hotIn sum(gpm)      hotOut(pulse=1gal) 
2019-03-23T00:00:05         24                 1 
2019-03-23T00:00:13         32                 1 

I've explored using a two step for-elif loop that just checks if the hotOut == 1, it works but its painfully slow on large datasets. I'm positive the timestamp functionality of Pandas will be superior if this is possible. I also can't simply resample on a set frequency because the interval between pulses changes periodically. The primary issue is the period of timestamps between pulses changes so a general resample rule would not work. I've also run into problems with matching data frame lengths when pulling out the timestamps associated with pulses and applying them to the main as a new index.

Upvotes: 2

Views: 83

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150745

IIUC, you can do:

s = df['hotOut(pulse=1gal)'].shift().ne(0).cumsum()
(df.groupby(s)
   .agg({'Index':'last', 'hotIn(gpm)':'sum'})
   .reset_index(drop=True)
)

Output:

                 Index  hotIn(gpm)
0  2019-03-23T00:00:05          24
1  2019-03-23T00:00:13          33

Upvotes: 1

ALollz
ALollz

Reputation: 59549

You don't want to group on the Index. You want to group whenever 'hotOut(pulse=1gal)' changes.

s = df['hotOut(pulse=1gal)'].cumsum().shift().bfill()

(df.reset_index()
   .groupby(s, as_index=False)
   .agg({'Index': 'last', 'hotIn(gpm)': 'sum', 'hotOut(pulse=1gal)': 'last'})
   .set_index('Index'))

                     hotIn(gpm)  hotOut(pulse=1gal)
Index                                              
2019-03-23T00:00:05          24                   1
2019-03-23T00:00:13          33                   1

Upvotes: 0

Related Questions