Identify continuous sequences or groups of boolean data in Pandas

Question

I have a boolean time based data set. As per the example below. I am interested in highlighting continuous sequences of more than three 1's in the data set. I would like to capture this in a new column called [Continuous_out_x]. Is there any efficient operation to do this?

I generated test data in this way:

df = pd.DataFrame(zip(list(np.random.randint(2, size=20)),list(np.random.randint(2, size=20))), columns=['tag1','tag2'] ,index=pd.date_range('2020-01-01', periods=20, freq='s'))

The output I got was the following:

print (df):
                        tag1  tag2
2020-01-01 00:00:00     0     0
2020-01-01 00:00:01     1     0
2020-01-01 00:00:02     1     0
2020-01-01 00:00:03     1     1
2020-01-01 00:00:04     1     0
2020-01-01 00:00:05     1     0
2020-01-01 00:00:06     1     1
2020-01-01 00:00:07     0     1
2020-01-01 00:00:08     0     0
2020-01-01 00:00:09     1     1
2020-01-01 00:00:10     1     0
2020-01-01 00:00:11     0     1
2020-01-01 00:00:12     1     0
2020-01-01 00:00:13     0     1
2020-01-01 00:00:14     0     1
2020-01-01 00:00:15     0     1
2020-01-01 00:00:16     1     1
2020-01-01 00:00:17     0     0
2020-01-01 00:00:18     0     1
2020-01-01 00:00:19     1     0

A solution to this example set (above) would look like this:

print(df):
                         tag1  tag2  Continuous_out_1  Continuous_out_2
2020-01-01 00:00:00     0     0                 0                 0
2020-01-01 00:00:01     1     0                 1                 0
2020-01-01 00:00:02     1     0                 1                 0
2020-01-01 00:00:03     1     1                 1                 0
2020-01-01 00:00:04     1     0                 1                 0
2020-01-01 00:00:05     1     0                 1                 0
2020-01-01 00:00:06     1     1                 1                 0
2020-01-01 00:00:07     0     1                 0                 0
2020-01-01 00:00:08     0     0                 0                 0
2020-01-01 00:00:09     1     1                 0                 0
2020-01-01 00:00:10     1     0                 0                 0
2020-01-01 00:00:11     0     1                 0                 0
2020-01-01 00:00:12     1     0                 0                 0
2020-01-01 00:00:13     0     1                 0                 1
2020-01-01 00:00:14     0     1                 0                 1
2020-01-01 00:00:15     0     1                 0                 1
2020-01-01 00:00:16     1     1                 0                 1
2020-01-01 00:00:17     0     0                 0                 0
2020-01-01 00:00:18     0     1                 0                 0
2020-01-01 00:00:19     1     0                 0                 0

RichieV · Accepted Answer

You can identify the regions of contiguous True/False and check if they are greater than your cutoff.

for colname, series in df.items():
    new = f'Continuous_{colname}'
    df[new] = series.diff().ne(0).cumsum() # label contiguous regions
    df[new] = series.groupby(df[new]).transform('size') # get size of region
    df[new] = df[new].gt(3) * series # mark with cutoff

Output

                     tag1  tag2  Continuous_tag1  Continuous_tag2
index
2020-01-01 00:00:00     0     0                0                0
2020-01-01 00:00:01     1     0                1                0
2020-01-01 00:00:02     1     0                1                0
2020-01-01 00:00:03     1     1                1                0
2020-01-01 00:00:04     1     0                1                0
2020-01-01 00:00:05     1     0                1                0
2020-01-01 00:00:06     1     1                1                0
2020-01-01 00:00:07     0     1                0                0
2020-01-01 00:00:08     0     0                0                0
2020-01-01 00:00:09     1     1                0                0
2020-01-01 00:00:10     1     0                0                0
2020-01-01 00:00:11     0     1                0                0
2020-01-01 00:00:12     1     0                0                0
2020-01-01 00:00:13     0     1                0                1
2020-01-01 00:00:14     0     1                0                1
2020-01-01 00:00:15     0     1                0                1
2020-01-01 00:00:16     1     1                0                1
2020-01-01 00:00:17     0     0                0                0
2020-01-01 00:00:18     0     1                0                0
2020-01-01 00:00:19     1     0                0                0

Identify continuous sequences or groups of boolean data in Pandas

Answers (2)

Related Questions