Hoekieee
Hoekieee

Reputation: 391

Find begin and end index of consecutive ones in pandas dataframe

I have the following dataframe:

     A    B    C
0    1    1    1
1    0    1    0
2    1    1    1
3    1    0    1
4    1    1    0
5    1    1    0 
6    0    1    1
7    0    1    0

of which I want to know the start and end index when the values are 1 for 3 or more consecutive values per column. Desired outcome:

Column    From    To    
     A       2     5
     B       1     3         
     B       4     7

first I filter out the ones that are not consecutive for 3 or more values

filtered_df = df.copy().apply(filter, threshold=3)

where

def filter(col, threshold=3):  
    mask = col.groupby((col != col.shift()).cumsum()).transform('count').lt(threshold)
    mask &= col.eq(1)
    col.update(col.loc[mask].replace(1,0))
    return col

filtered_df now look as:

     A    B    C
0    0    1    0
1    0    1    0
2    1    1    0
3    1    0    0
4    1    1    0
5    1    1    0 
6    0    1    0
7    0    1    0

If the dataframe would have only one column with zeros and ones the result could be achieved as in How to use pandas to find consecutive same data in time series. However, I am struggeling to do something similar for multiple columns at once.

Upvotes: 3

Views: 1372

Answers (2)

jezrael
jezrael

Reputation: 863166

Use DataFrame.pipe for apply function for all DataFrame.

In first solution get first and last value of consecutive 1 per each columns, add output to lists and last concat:

def f(df, threshold=3): 
    out = []
    for col in df.columns:
        m = df[col].eq(1)
        g = (df[col] != df[col].shift()).cumsum()[m]
        mask = g.groupby(g).transform('count').ge(threshold)
        filt = g[mask].reset_index()
        output = filt.groupby(col)['index'].agg(['first','last'])
        output.insert(0, 'col', col)
        out.append(output)

    return pd.concat(out, ignore_index=True)

Or first reshape by unstack and then apply solution:

def f(df, threshold=3):

    df1 = df.unstack().rename_axis(('col','idx')).reset_index(name='val')
    m = df1['val'].eq(1)
    g = (df1['val'] != df1.groupby('col')['val'].shift()).cumsum()
    mask = g.groupby(g).transform('count').ge(threshold) & m
    return (df1[mask].groupby([df1['col'], g])['idx']
                    .agg(['first','last'])
                    .reset_index(level=1, drop=True)
                    .reset_index())


filtered_df = df.pipe(f, threshold=3)
print (filtered_df)
  col  first  last
0   A      2     5
1   B      0     2
2   B      4     7
filtered_df = df.pipe(f, threshold=2)
print (filtered_df)
  col  first  last
0   A      2     5
1   B      0     2
2   B      4     7
3   C      2     3

Upvotes: 2

a_guest
a_guest

Reputation: 36299

You can use rolling to create a window over the data frame. Then you can apply all your conditions and shift the window back to its start location:

length = 3
window = df.rolling(length)
mask = (window.min() == 1) & (window.max() == 1)
mask = mask.shift(1 - length)
print(mask)

which prints:

       A      B      C
0  False   True  False
1  False  False  False
2   True  False  False
3   True  False  False
4  False   True  False
5  False   True  False
6    NaN    NaN    NaN
7    NaN    NaN    NaN

Upvotes: 1

Related Questions