hamslice
hamslice

Reputation: 149

Identify continuous sequences or groups of boolean data in Pandas

I have a boolean time based data set. As per the example below. I am interested in highlighting continuous sequences of more than three 1's in the data set. I would like to capture this in a new column called [Continuous_out_x]. Is there any efficient operation to do this?

I generated test data in this way:

df = pd.DataFrame(zip(list(np.random.randint(2, size=20)),list(np.random.randint(2, size=20))), columns=['tag1','tag2'] ,index=pd.date_range('2020-01-01', periods=20, freq='s'))

The output I got was the following:

print (df):
                        tag1  tag2
2020-01-01 00:00:00     0     0
2020-01-01 00:00:01     1     0
2020-01-01 00:00:02     1     0
2020-01-01 00:00:03     1     1
2020-01-01 00:00:04     1     0
2020-01-01 00:00:05     1     0
2020-01-01 00:00:06     1     1
2020-01-01 00:00:07     0     1
2020-01-01 00:00:08     0     0
2020-01-01 00:00:09     1     1
2020-01-01 00:00:10     1     0
2020-01-01 00:00:11     0     1
2020-01-01 00:00:12     1     0
2020-01-01 00:00:13     0     1
2020-01-01 00:00:14     0     1
2020-01-01 00:00:15     0     1
2020-01-01 00:00:16     1     1
2020-01-01 00:00:17     0     0
2020-01-01 00:00:18     0     1
2020-01-01 00:00:19     1     0

A solution to this example set (above) would look like this:

print(df):
                         tag1  tag2  Continuous_out_1  Continuous_out_2
2020-01-01 00:00:00     0     0                 0                 0
2020-01-01 00:00:01     1     0                 1                 0
2020-01-01 00:00:02     1     0                 1                 0
2020-01-01 00:00:03     1     1                 1                 0
2020-01-01 00:00:04     1     0                 1                 0
2020-01-01 00:00:05     1     0                 1                 0
2020-01-01 00:00:06     1     1                 1                 0
2020-01-01 00:00:07     0     1                 0                 0
2020-01-01 00:00:08     0     0                 0                 0
2020-01-01 00:00:09     1     1                 0                 0
2020-01-01 00:00:10     1     0                 0                 0
2020-01-01 00:00:11     0     1                 0                 0
2020-01-01 00:00:12     1     0                 0                 0
2020-01-01 00:00:13     0     1                 0                 1
2020-01-01 00:00:14     0     1                 0                 1
2020-01-01 00:00:15     0     1                 0                 1
2020-01-01 00:00:16     1     1                 0                 1
2020-01-01 00:00:17     0     0                 0                 0
2020-01-01 00:00:18     0     1                 0                 0
2020-01-01 00:00:19     1     0                 0                 0

Upvotes: 1

Views: 196

Answers (2)

RichieV
RichieV

Reputation: 5183

You can identify the regions of contiguous True/False and check if they are greater than your cutoff.

for colname, series in df.items():
    new = f'Continuous_{colname}'
    df[new] = series.diff().ne(0).cumsum() # label contiguous regions
    df[new] = series.groupby(df[new]).transform('size') # get size of region
    df[new] = df[new].gt(3) * series # mark with cutoff

Output

                     tag1  tag2  Continuous_tag1  Continuous_tag2
index
2020-01-01 00:00:00     0     0                0                0
2020-01-01 00:00:01     1     0                1                0
2020-01-01 00:00:02     1     0                1                0
2020-01-01 00:00:03     1     1                1                0
2020-01-01 00:00:04     1     0                1                0
2020-01-01 00:00:05     1     0                1                0
2020-01-01 00:00:06     1     1                1                0
2020-01-01 00:00:07     0     1                0                0
2020-01-01 00:00:08     0     0                0                0
2020-01-01 00:00:09     1     1                0                0
2020-01-01 00:00:10     1     0                0                0
2020-01-01 00:00:11     0     1                0                0
2020-01-01 00:00:12     1     0                0                0
2020-01-01 00:00:13     0     1                0                1
2020-01-01 00:00:14     0     1                0                1
2020-01-01 00:00:15     0     1                0                1
2020-01-01 00:00:16     1     1                0                1
2020-01-01 00:00:17     0     0                0                0
2020-01-01 00:00:18     0     1                0                0
2020-01-01 00:00:19     1     0                0                0

Upvotes: 2

anon01
anon01

Reputation: 11171

You can do this as:

  • create a series that distinguishes each streak (group)
  • assign bool to groups with more than three rows

code

# ok to loop over a few columns, still very performant
for col in ["tag1", "tag2"]:
    col_no = col[-1]
    df[f"group_{col}"] = np.cumsum(df[col].shift(1) != df[col])
    df[f"{col}_counts"] = df.groupby(f"group_{col}").tag1.transform("count") > 3
    df[f"Continuous_out_{col_no}"] = df[f"{col}_counts"].astype(int)
    df = df.drop(columns=[f"group_{col}", f"{col}_counts"])
    

output

                     tag1  tag2  Continuous_out_1  Continuous_out_2
2020-01-01 00:00:00     0     0                 0                 0
           00:00:01     1     0                 1                 0
           00:00:02     1     0                 1                 0
           00:00:03     1     1                 1                 0
           00:00:04     1     0                 1                 0
           00:00:05     1     0                 1                 0
           00:00:06     1     1                 1                 0
           00:00:07     0     1                 0                 0
           00:00:08     0     0                 0                 0
           00:00:09     1     1                 0                 0
           00:00:10     1     0                 0                 0
           00:00:11     0     1                 0                 0
           00:00:12     1     0                 0                 0
           00:00:13     0     1                 0                 1
           00:00:14     0     1                 0                 1
           00:00:15     0     1                 0                 1
           00:00:16     1     1                 0                 1
           00:00:17     0     0                 0                 0
           00:00:18     0     1                 0                 0
           00:00:19     1     0                 0                 0

Upvotes: 3

Related Questions