Reputation: 391
I have the following dataframe:
A B C
0 1 1 1
1 0 1 0
2 1 1 1
3 1 0 1
4 1 1 0
5 1 1 0
6 0 1 1
7 0 1 0
of which I want to know the start and end index when the values are 1 for 3 or more consecutive values per column. Desired outcome:
Column From To
A 2 5
B 1 3
B 4 7
first I filter out the ones that are not consecutive for 3 or more values
filtered_df = df.copy().apply(filter, threshold=3)
where
def filter(col, threshold=3):
mask = col.groupby((col != col.shift()).cumsum()).transform('count').lt(threshold)
mask &= col.eq(1)
col.update(col.loc[mask].replace(1,0))
return col
filtered_df
now look as:
A B C
0 0 1 0
1 0 1 0
2 1 1 0
3 1 0 0
4 1 1 0
5 1 1 0
6 0 1 0
7 0 1 0
If the dataframe would have only one column with zeros and ones the result could be achieved as in How to use pandas to find consecutive same data in time series. However, I am struggeling to do something similar for multiple columns at once.
Upvotes: 3
Views: 1372
Reputation: 863166
Use DataFrame.pipe
for apply function for all DataFrame
.
In first solution get first and last value of consecutive 1
per each columns, add output to lists and last concat
:
def f(df, threshold=3):
out = []
for col in df.columns:
m = df[col].eq(1)
g = (df[col] != df[col].shift()).cumsum()[m]
mask = g.groupby(g).transform('count').ge(threshold)
filt = g[mask].reset_index()
output = filt.groupby(col)['index'].agg(['first','last'])
output.insert(0, 'col', col)
out.append(output)
return pd.concat(out, ignore_index=True)
Or first reshape by unstack
and then apply solution:
def f(df, threshold=3):
df1 = df.unstack().rename_axis(('col','idx')).reset_index(name='val')
m = df1['val'].eq(1)
g = (df1['val'] != df1.groupby('col')['val'].shift()).cumsum()
mask = g.groupby(g).transform('count').ge(threshold) & m
return (df1[mask].groupby([df1['col'], g])['idx']
.agg(['first','last'])
.reset_index(level=1, drop=True)
.reset_index())
filtered_df = df.pipe(f, threshold=3)
print (filtered_df)
col first last
0 A 2 5
1 B 0 2
2 B 4 7
filtered_df = df.pipe(f, threshold=2)
print (filtered_df)
col first last
0 A 2 5
1 B 0 2
2 B 4 7
3 C 2 3
Upvotes: 2
Reputation: 36299
You can use rolling
to create a window over the data frame. Then you can apply all your conditions and shift
the window back to its start location:
length = 3
window = df.rolling(length)
mask = (window.min() == 1) & (window.max() == 1)
mask = mask.shift(1 - length)
print(mask)
which prints:
A B C
0 False True False
1 False False False
2 True False False
3 True False False
4 False True False
5 False True False
6 NaN NaN NaN
7 NaN NaN NaN
Upvotes: 1