Reputation: 149
I have a boolean time based data set. As per the example below. I am interested in highlighting continuous sequences of more than three 1's in the data set. I would like to capture this in a new column called [Continuous_out_x]
. Is there any efficient operation to do this?
I generated test data in this way:
df = pd.DataFrame(zip(list(np.random.randint(2, size=20)),list(np.random.randint(2, size=20))), columns=['tag1','tag2'] ,index=pd.date_range('2020-01-01', periods=20, freq='s'))
The output I got was the following:
print (df):
tag1 tag2
2020-01-01 00:00:00 0 0
2020-01-01 00:00:01 1 0
2020-01-01 00:00:02 1 0
2020-01-01 00:00:03 1 1
2020-01-01 00:00:04 1 0
2020-01-01 00:00:05 1 0
2020-01-01 00:00:06 1 1
2020-01-01 00:00:07 0 1
2020-01-01 00:00:08 0 0
2020-01-01 00:00:09 1 1
2020-01-01 00:00:10 1 0
2020-01-01 00:00:11 0 1
2020-01-01 00:00:12 1 0
2020-01-01 00:00:13 0 1
2020-01-01 00:00:14 0 1
2020-01-01 00:00:15 0 1
2020-01-01 00:00:16 1 1
2020-01-01 00:00:17 0 0
2020-01-01 00:00:18 0 1
2020-01-01 00:00:19 1 0
A solution to this example set (above) would look like this:
print(df):
tag1 tag2 Continuous_out_1 Continuous_out_2
2020-01-01 00:00:00 0 0 0 0
2020-01-01 00:00:01 1 0 1 0
2020-01-01 00:00:02 1 0 1 0
2020-01-01 00:00:03 1 1 1 0
2020-01-01 00:00:04 1 0 1 0
2020-01-01 00:00:05 1 0 1 0
2020-01-01 00:00:06 1 1 1 0
2020-01-01 00:00:07 0 1 0 0
2020-01-01 00:00:08 0 0 0 0
2020-01-01 00:00:09 1 1 0 0
2020-01-01 00:00:10 1 0 0 0
2020-01-01 00:00:11 0 1 0 0
2020-01-01 00:00:12 1 0 0 0
2020-01-01 00:00:13 0 1 0 1
2020-01-01 00:00:14 0 1 0 1
2020-01-01 00:00:15 0 1 0 1
2020-01-01 00:00:16 1 1 0 1
2020-01-01 00:00:17 0 0 0 0
2020-01-01 00:00:18 0 1 0 0
2020-01-01 00:00:19 1 0 0 0
Upvotes: 1
Views: 196
Reputation: 5183
You can identify the regions of contiguous True/False and check if they are greater than your cutoff.
for colname, series in df.items():
new = f'Continuous_{colname}'
df[new] = series.diff().ne(0).cumsum() # label contiguous regions
df[new] = series.groupby(df[new]).transform('size') # get size of region
df[new] = df[new].gt(3) * series # mark with cutoff
Output
tag1 tag2 Continuous_tag1 Continuous_tag2
index
2020-01-01 00:00:00 0 0 0 0
2020-01-01 00:00:01 1 0 1 0
2020-01-01 00:00:02 1 0 1 0
2020-01-01 00:00:03 1 1 1 0
2020-01-01 00:00:04 1 0 1 0
2020-01-01 00:00:05 1 0 1 0
2020-01-01 00:00:06 1 1 1 0
2020-01-01 00:00:07 0 1 0 0
2020-01-01 00:00:08 0 0 0 0
2020-01-01 00:00:09 1 1 0 0
2020-01-01 00:00:10 1 0 0 0
2020-01-01 00:00:11 0 1 0 0
2020-01-01 00:00:12 1 0 0 0
2020-01-01 00:00:13 0 1 0 1
2020-01-01 00:00:14 0 1 0 1
2020-01-01 00:00:15 0 1 0 1
2020-01-01 00:00:16 1 1 0 1
2020-01-01 00:00:17 0 0 0 0
2020-01-01 00:00:18 0 1 0 0
2020-01-01 00:00:19 1 0 0 0
Upvotes: 2
Reputation: 11171
You can do this as:
code
# ok to loop over a few columns, still very performant
for col in ["tag1", "tag2"]:
col_no = col[-1]
df[f"group_{col}"] = np.cumsum(df[col].shift(1) != df[col])
df[f"{col}_counts"] = df.groupby(f"group_{col}").tag1.transform("count") > 3
df[f"Continuous_out_{col_no}"] = df[f"{col}_counts"].astype(int)
df = df.drop(columns=[f"group_{col}", f"{col}_counts"])
output
tag1 tag2 Continuous_out_1 Continuous_out_2
2020-01-01 00:00:00 0 0 0 0
00:00:01 1 0 1 0
00:00:02 1 0 1 0
00:00:03 1 1 1 0
00:00:04 1 0 1 0
00:00:05 1 0 1 0
00:00:06 1 1 1 0
00:00:07 0 1 0 0
00:00:08 0 0 0 0
00:00:09 1 1 0 0
00:00:10 1 0 0 0
00:00:11 0 1 0 0
00:00:12 1 0 0 0
00:00:13 0 1 0 1
00:00:14 0 1 0 1
00:00:15 0 1 0 1
00:00:16 1 1 0 1
00:00:17 0 0 0 0
00:00:18 0 1 0 0
00:00:19 1 0 0 0
Upvotes: 3