meliksahturker
meliksahturker

Reputation: 1504

How can I count the number of consecutive TRUEs in a DataFrame?

I have a dataset made of True and False.

Sample Table:
       A      B      C
0  False   True  False
1  False  False  False
2   True   True  False
3   True   True   True
4  False   True  False
5   True   True   True
6   True  False  False
7   True  False   True
8  False   True   True
9   True  False  False

I want to count the number of consecutive True values for every column, and if there's more than one consecutive True series, I want to get the max of it.

For the table above, I would get:

length = [3, 4, 2]

I found similar threads but none resolved my problem.

Since I do and will have many more columns(products), I need to do this regardless of the column name, for the whole table and get an array as the result.

And if possible, I'd like to learn the index of the first true of the longest sequence aka where this longest true series starts, so the result would be for this one:

index = [5, 2, 7]

Upvotes: 16

Views: 8973

Answers (2)

Divakar
Divakar

Reputation: 221614

We would basically leverage two philosophies - Catching shifts on compared array and Offsetting each column results so that we could vectorize it.

So, with that intention set, here's one way to achieve the desired results -

def maxisland_start_len_mask(a, fillna_index = -1, fillna_len = 0):
    # a is a boolean array

    pad = np.zeros(a.shape[1],dtype=bool)
    mask = np.vstack((pad, a, pad))

    mask_step = mask[1:] != mask[:-1]
    idx = np.flatnonzero(mask_step.T)
    island_starts = idx[::2]
    island_lens = idx[1::2] - idx[::2]
    n_islands_percol = mask_step.sum(0)//2

    bins = np.repeat(np.arange(a.shape[1]),n_islands_percol)
    scale = island_lens.max()+1

    scaled_idx = np.argsort(scale*bins + island_lens)
    grp_shift_idx = np.r_[0,n_islands_percol.cumsum()]
    max_island_starts = island_starts[scaled_idx[grp_shift_idx[1:]-1]]

    max_island_percol_start = max_island_starts%(a.shape[0]+1)

    valid = n_islands_percol!=0
    cut_idx = grp_shift_idx[:-1][valid]
    max_island_percol_len = np.maximum.reduceat(island_lens, cut_idx)

    out_len = np.full(a.shape[1], fillna_len, dtype=int)
    out_len[valid] = max_island_percol_len
    out_index = np.where(valid,max_island_percol_start,fillna_index)
    return out_index, out_len

Sample run -

# Generic case to handle all 0s columns
In [112]: a
Out[112]: 
array([[False, False, False],
       [False, False, False],
       [ True, False, False],
       [ True, False,  True],
       [False, False, False],
       [ True, False,  True],
       [ True, False, False],
       [ True, False,  True],
       [False, False,  True],
       [ True, False, False]])

In [117]: starts,lens = maxisland_start_len_mask(a, fillna_index=-1, fillna_len=0)

In [118]: starts
Out[118]: array([ 5, -1,  7])

In [119]: lens
Out[119]: array([3, 0, 2])

Upvotes: 7

jezrael
jezrael

Reputation: 863156

Solution should be simplify, if always at least one True per column:

b = df.cumsum()
c = b.sub(b.mask(df).ffill().fillna(0)).astype(int)

print (c)
   A  B  C
0  0  1  0
1  0  0  0
2  1  1  0
3  2  2  1
4  0  3  0
5  1  4  1
6  2  0  0
7  3  0  1
8  0  1  2
9  1  0  0

#get maximal value of all columns
length = c.max().tolist()
print (length)
[3, 4, 2]

#get indexes by maximal value, subtract length and add 1 
index = c.idxmax().sub(length).add(1).tolist()
print (index)
[5, 2, 7]

Detail:

print (pd.concat([b,
                  b.mask(df), 
                  b.mask(df).ffill(), 
                  b.mask(df).ffill().fillna(0),
                  b.sub(b.mask(df).ffill().fillna(0)).astype(int)
                  ], axis=1, 
                  keys=('cumsum', 'mask', 'ffill', 'fillna','sub')))

  cumsum       mask           ffill           fillna           sub      
       A  B  C    A    B    C     A    B    C      A    B    C   A  B  C
0      0  1  0  0.0  NaN  0.0   0.0  NaN  0.0    0.0  0.0  0.0   0  1  0
1      0  1  0  0.0  1.0  0.0   0.0  1.0  0.0    0.0  1.0  0.0   0  0  0
2      1  2  0  NaN  NaN  0.0   0.0  1.0  0.0    0.0  1.0  0.0   1  1  0
3      2  3  1  NaN  NaN  NaN   0.0  1.0  0.0    0.0  1.0  0.0   2  2  1
4      2  4  1  2.0  NaN  1.0   2.0  1.0  1.0    2.0  1.0  1.0   0  3  0
5      3  5  2  NaN  NaN  NaN   2.0  1.0  1.0    2.0  1.0  1.0   1  4  1
6      4  5  2  NaN  5.0  2.0   2.0  5.0  2.0    2.0  5.0  2.0   2  0  0
7      5  5  3  NaN  5.0  NaN   2.0  5.0  2.0    2.0  5.0  2.0   3  0  1
8      5  6  4  5.0  NaN  NaN   5.0  5.0  2.0    5.0  5.0  2.0   0  1  2
9      6  6  4  NaN  6.0  4.0   5.0  6.0  4.0    5.0  6.0  4.0   1  0  0

EDIT:

General solution working with only False columns - add numpy.where with boolean mask created by DataFrame.any:

print (df)
       A      B      C
0  False   True  False
1  False  False  False
2   True   True  False
3   True   True  False
4  False   True  False
5   True   True  False
6   True  False  False
7   True  False  False
8  False   True  False
9   True  False  False

b = df.cumsum()
c = b.sub(b.mask(df).ffill().fillna(0)).astype(int)

mask = df.any()
length = np.where(mask, c.max(), -1).tolist()
print (length)
[3, 4, -1]

index =  np.where(mask, c.idxmax().sub(c.max()).add(1), 0).tolist()
print (index)
[5, 2, 0]

Upvotes: 21

Related Questions