fodma1
fodma1

Reputation: 3535

Pandas measure elapsed time when condition is true

I have the following dataframe:

                 dt binary
2016-01-01 00:00:00  False
2016-01-01 00:00:01  False
2016-01-01 00:00:02  False
2016-01-01 00:00:03  False
2016-01-01 00:00:04   True
2016-01-01 00:00:05   True
2016-01-01 00:00:06   True
2016-01-01 00:00:07  False
2016-01-01 00:00:08  False
2016-01-01 00:00:09   True
2016-01-01 00:00:10   True

I would like to sum the elapsed time when binary is True. I'm sharing my solution, which implements it, but something tells me there should be an easier way since it is a pretty basic feature of time series data. Note that the data is most probably equidistant, but I can't rely on that.

df['binary_grp'] = (df.binary.diff(1) != False).astype(int).cumsum()
# Throw away False values
df = df[df.binary]
groupby = df.groupby('binary_grp')
df = pd.DataFrame({'timespan': groupby.dt.last() - groupby.dt.first()})
return df.timespan.sum().seconds / 60.0

The trickiest part is probably the first line. What it does, it basically assigns an incremented number to each consecutive block. Here's how the data looks like after that:

                 dt binary  binary_grp
2016-01-01 00:00:00  False           1
2016-01-01 00:00:01  False           1
2016-01-01 00:00:02  False           1
2016-01-01 00:00:03  False           1
2016-01-01 00:00:04   True           2
2016-01-01 00:00:05   True           2
2016-01-01 00:00:06   True           2
2016-01-01 00:00:07  False           3
2016-01-01 00:00:08  False           3
2016-01-01 00:00:09   True           4
2016-01-01 00:00:10   True           4

Is there a better way to accomplish this? I guess this code is performant, my worry is readability.

Upvotes: 2

Views: 715

Answers (2)

piRSquared
piRSquared

Reputation: 294328

IIUC:

You want to find the sum of time spanned across the entire series where binary is True.

However, we have to make some choices or assumptions

                    dt  binary
0  2016-01-01 00:00:00   False
1  2016-01-01 00:00:01   False
2  2016-01-01 00:00:02   False
3  2016-01-01 00:00:03   False
4  2016-01-01 00:00:04    True # <- This where time starts
5  2016-01-01 00:00:05    True
6  2016-01-01 00:00:06    True
7  2016-01-01 00:00:07   False # <- And ends here. So this would
8  2016-01-01 00:00:08   False # be 00:00:07 - 00:00:04 or 3 seconds
9  2016-01-01 00:00:09    True # <- Starts again
10 2016-01-01 00:00:10    True # <- But ends here because
                               # I don't have another Timestamp

With those assumptions, we can use diff, multiply, and sum

df.dt.diff().shift(-1).mul(df.binary).sum()

Timedelta('0 days 00:00:04')

We can then use this concept along with groupby

# Use xor and cumsum to identify change in True to False and False to True
grps = (df.binary ^ df.binary.shift()).cumsum()
mask = df.binary.groupby(grps).first()
df.dt.diff().shift(-1).groupby(grps).sum()[mask]

binary
1   00:00:03
3   00:00:01
Name: dt, dtype: timedelta64[ns]

Or without the mask

pd.concat([df.dt.diff().shift(-1).groupby(grps).sum(), mask], axis=1)

             dt  binary
binary                 
0      00:00:04   False
1      00:00:03    True
2      00:00:02   False
3      00:00:01    True

Upvotes: 2

jezrael
jezrael

Reputation: 862761

In my opinion your solution is nice.

Another solution:

Compare shifted values with ne, get groups by cumsum.

After filtering is possible use apply with difference by selecting with iloc:

df['binary_grp'] = (df.binary.ne(df.binary.shift())).cumsum()

df = df[df.binary]

s = df.groupby('binary_grp')['dt'].apply(lambda x: x.iloc[-1] - x.iloc[0])
print (s)
binary_grp
2   00:00:02
4   00:00:01
Name: dt, dtype: timedelta64[ns]

all_time =  s.sum().seconds / 60.0
print (all_time)
0.05

In your solution new DataFrame is not necessary if need only all_time:

groupby = df.groupby('binary_grp')

s = groupby.dt.last() - groupby.dt.first()
all_time =  s.sum().seconds / 60.0
print (all_time)
0.05

But if need it is possible create it from Series s by to_frame:

df1 = s.to_frame('timestamp')
print (df1)
           timestamp
binary_grp          
2           00:00:02
4           00:00:01

Upvotes: 2

Related Questions