Reputation: 3535
I have the following dataframe:
dt binary
2016-01-01 00:00:00 False
2016-01-01 00:00:01 False
2016-01-01 00:00:02 False
2016-01-01 00:00:03 False
2016-01-01 00:00:04 True
2016-01-01 00:00:05 True
2016-01-01 00:00:06 True
2016-01-01 00:00:07 False
2016-01-01 00:00:08 False
2016-01-01 00:00:09 True
2016-01-01 00:00:10 True
I would like to sum the elapsed time when binary
is True
. I'm sharing my solution, which implements it, but something tells me there should be an easier way since it is a pretty basic feature of time series data. Note that the data is most probably equidistant, but I can't rely on that.
df['binary_grp'] = (df.binary.diff(1) != False).astype(int).cumsum()
# Throw away False values
df = df[df.binary]
groupby = df.groupby('binary_grp')
df = pd.DataFrame({'timespan': groupby.dt.last() - groupby.dt.first()})
return df.timespan.sum().seconds / 60.0
The trickiest part is probably the first line. What it does, it basically assigns an incremented number to each consecutive block. Here's how the data looks like after that:
dt binary binary_grp
2016-01-01 00:00:00 False 1
2016-01-01 00:00:01 False 1
2016-01-01 00:00:02 False 1
2016-01-01 00:00:03 False 1
2016-01-01 00:00:04 True 2
2016-01-01 00:00:05 True 2
2016-01-01 00:00:06 True 2
2016-01-01 00:00:07 False 3
2016-01-01 00:00:08 False 3
2016-01-01 00:00:09 True 4
2016-01-01 00:00:10 True 4
Is there a better way to accomplish this? I guess this code is performant, my worry is readability.
Upvotes: 2
Views: 715
Reputation: 294328
IIUC:
You want to find the sum of time spanned across the entire series where binary
is True
.
However, we have to make some choices or assumptions
dt binary
0 2016-01-01 00:00:00 False
1 2016-01-01 00:00:01 False
2 2016-01-01 00:00:02 False
3 2016-01-01 00:00:03 False
4 2016-01-01 00:00:04 True # <- This where time starts
5 2016-01-01 00:00:05 True
6 2016-01-01 00:00:06 True
7 2016-01-01 00:00:07 False # <- And ends here. So this would
8 2016-01-01 00:00:08 False # be 00:00:07 - 00:00:04 or 3 seconds
9 2016-01-01 00:00:09 True # <- Starts again
10 2016-01-01 00:00:10 True # <- But ends here because
# I don't have another Timestamp
With those assumptions, we can use diff
, multiply, and sum
df.dt.diff().shift(-1).mul(df.binary).sum()
Timedelta('0 days 00:00:04')
We can then use this concept along with groupby
# Use xor and cumsum to identify change in True to False and False to True
grps = (df.binary ^ df.binary.shift()).cumsum()
mask = df.binary.groupby(grps).first()
df.dt.diff().shift(-1).groupby(grps).sum()[mask]
binary
1 00:00:03
3 00:00:01
Name: dt, dtype: timedelta64[ns]
Or without the mask
pd.concat([df.dt.diff().shift(-1).groupby(grps).sum(), mask], axis=1)
dt binary
binary
0 00:00:04 False
1 00:00:03 True
2 00:00:02 False
3 00:00:01 True
Upvotes: 2
Reputation: 862761
In my opinion your solution is nice.
Another solution:
Compare shift
ed values with ne
, get groups by cumsum
.
After filtering is possible use apply
with difference by selecting with iloc
:
df['binary_grp'] = (df.binary.ne(df.binary.shift())).cumsum()
df = df[df.binary]
s = df.groupby('binary_grp')['dt'].apply(lambda x: x.iloc[-1] - x.iloc[0])
print (s)
binary_grp
2 00:00:02
4 00:00:01
Name: dt, dtype: timedelta64[ns]
all_time = s.sum().seconds / 60.0
print (all_time)
0.05
In your solution new DataFrame
is not necessary if need only all_time
:
groupby = df.groupby('binary_grp')
s = groupby.dt.last() - groupby.dt.first()
all_time = s.sum().seconds / 60.0
print (all_time)
0.05
But if need it is possible create it from Series
s
by to_frame
:
df1 = s.to_frame('timestamp')
print (df1)
timestamp
binary_grp
2 00:00:02
4 00:00:01
Upvotes: 2