SantoshGupta7
SantoshGupta7

Reputation: 6197

Most computationally efficient way to count consecutive repeating values

Say I have a boolean array

a2= np.array([False, False, True, False, False, True, True, True, False, False])

I want an array which contains the number of elements of each group of True values

Desired result:

np.array([1, 3])

Current solution:

sums = []
current_sum = 0
prev = False
for boo in a2:
    if boo:
        current_sum+=1
        prev = True
    if prev and not boo:
        sums.append(current_sum)
        current_sum = 0
    if not boo:
        prev = False
np.array(sums)

May not be the most computationally efficient. Seems like np.cumsum could be used in a creative manner but I am not able to think of a solution.

Upvotes: 2

Views: 139

Answers (3)

user7864386
user7864386

Reputation:

Another way using np.where + np.diff to identify the split locations:

out = [ar.sum() for ar in np.split(a2, np.where(np.diff(a2.astype(int), prepend=0)==1)[0])[1:]]

np.split is slow, so we can replace it with zip in a list comp and walk the over an array of indices. Also, instead of sum, we could index the array and use len:

idx = np.where(np.diff(a2.astype(int), prepend=0)==1)[0]
out = [len(a2[i:j][a2[i:j]]) for i,j in zip(idx, idx[1:])] + [len(a2[idx[-1]:][a2[idx[-1]:]])]

Output:

[1, 3]

Performance comparison:

import perfplot
import numpy as np
import itertools
import random

def diff_where_split_sum(a2):
    return [ar.sum() for ar in np.split(a2, np.where(np.diff(a2.astype(int), prepend=0)==1)[0])[1:]]

def flatnonzero_split_if_sum(a2):
    return [l.sum() for l in np.split(a2, np.flatnonzero(~a2)) if l.sum() > 0]

def groupby_if_sum(a2):
    return [sum( 1 for _ in group ) for key, group in itertools.groupby( a2 ) if key]

def diff_where_slice_index_len(a2):
    idx = np.where(np.diff(a2.astype(int), prepend=0)==1)[0]
    return [len(a2[i:j][a2[i:j]]) for i,j in zip(idx, idx[1:])] + [len(a2[idx[-1]:][a2[idx[-1]:]])]

perfplot.show(
    setup=lambda n: np.array(random.choices([True, False], k=10) * n),
    kernels=[
        lambda arr: diff_where_split_sum(arr),
        lambda arr: flatnonzero_split_if_sum(arr),
        lambda arr: groupby_if_sum(arr),
        lambda arr: diff_where_slice_len(arr)
    ],
    labels=['diff_where_split_sum', 'flatnonzero_split_if_sum', 
            'groupby_if_sum', 'diff_where_slice_index_len'],
    n_range=[2 ** k for k in range(20)],
    equality_check=np.allclose,  
    xlabel='~len(arr)'
)

enter image description here

Upvotes: 2

vaeVictis
vaeVictis

Reputation: 492

Another solution with itertools

import numpy as np
import itertools

a2= np.array([False, False, True, False, False, True, True, True, False, False])

foo = [ sum( 1 for _ in group ) for key, group in itertools.groupby( a2 ) if key ]

print(foo)

output

[1, 3]

Edit:

Since you asked for performance too, I wrote this test code to check the performances of the three codes proposed so far:

#! /usr/bin/env python3
#-*- coding: utf-8 -*-

import itertools
import numpy as np
import timeit

a2 = np.random.choice([True, False], 1000000)


start_time = timeit.default_timer()
out = [ar.sum() for ar in np.split(a2, np.where(np.diff(a2.astype(int))==1)[0]+1)[1:]]
print(timeit.default_timer() - start_time)

start_time = timeit.default_timer()
foo = [ sum( 1 for _ in group ) for key, group in itertools.groupby( a2 ) if key ]
print(timeit.default_timer() - start_time)

start_time = timeit.default_timer()
l = [l.sum() for l in np.split(a2, np.flatnonzero(~a2)) if l.sum() > 0]
print(timeit.default_timer() - start_time)

print(out == foo == l)

Outputs are always like:

$ python3 test.py 
1.702160262999314
2.1189031369995064
4.760941952999929
True

So, the best solution is the one proposed by enke , followed by mine and then the one proposed by richardec

Upvotes: 1

user17242583
user17242583

Reputation:

You could use list comprehension with np.split + np.flatnonzero:

l = [l.sum() for l in np.split(a2, np.flatnonzero(~a2)) if l.sum() > 0]

Output:

>>> l
[1, 3]

Upvotes: 1

Related Questions