Reputation: 6197
Say I have a boolean array
a2= np.array([False, False, True, False, False, True, True, True, False, False])
I want an array which contains the number of elements of each group of True values
Desired result:
np.array([1, 3])
Current solution:
sums = []
current_sum = 0
prev = False
for boo in a2:
if boo:
current_sum+=1
prev = True
if prev and not boo:
sums.append(current_sum)
current_sum = 0
if not boo:
prev = False
np.array(sums)
May not be the most computationally efficient. Seems like np.cumsum
could be used in a creative manner but I am not able to think of a solution.
Upvotes: 2
Views: 139
Reputation:
Another way using np.where
+ np.diff
to identify the split locations:
out = [ar.sum() for ar in np.split(a2, np.where(np.diff(a2.astype(int), prepend=0)==1)[0])[1:]]
np.split
is slow, so we can replace it with zip
in a list comp and walk the over an array of indices. Also, instead of sum
, we could index the array and use len
:
idx = np.where(np.diff(a2.astype(int), prepend=0)==1)[0]
out = [len(a2[i:j][a2[i:j]]) for i,j in zip(idx, idx[1:])] + [len(a2[idx[-1]:][a2[idx[-1]:]])]
Output:
[1, 3]
Performance comparison:
import perfplot
import numpy as np
import itertools
import random
def diff_where_split_sum(a2):
return [ar.sum() for ar in np.split(a2, np.where(np.diff(a2.astype(int), prepend=0)==1)[0])[1:]]
def flatnonzero_split_if_sum(a2):
return [l.sum() for l in np.split(a2, np.flatnonzero(~a2)) if l.sum() > 0]
def groupby_if_sum(a2):
return [sum( 1 for _ in group ) for key, group in itertools.groupby( a2 ) if key]
def diff_where_slice_index_len(a2):
idx = np.where(np.diff(a2.astype(int), prepend=0)==1)[0]
return [len(a2[i:j][a2[i:j]]) for i,j in zip(idx, idx[1:])] + [len(a2[idx[-1]:][a2[idx[-1]:]])]
perfplot.show(
setup=lambda n: np.array(random.choices([True, False], k=10) * n),
kernels=[
lambda arr: diff_where_split_sum(arr),
lambda arr: flatnonzero_split_if_sum(arr),
lambda arr: groupby_if_sum(arr),
lambda arr: diff_where_slice_len(arr)
],
labels=['diff_where_split_sum', 'flatnonzero_split_if_sum',
'groupby_if_sum', 'diff_where_slice_index_len'],
n_range=[2 ** k for k in range(20)],
equality_check=np.allclose,
xlabel='~len(arr)'
)
Upvotes: 2
Reputation: 492
Another solution with itertools
import numpy as np
import itertools
a2= np.array([False, False, True, False, False, True, True, True, False, False])
foo = [ sum( 1 for _ in group ) for key, group in itertools.groupby( a2 ) if key ]
print(foo)
output
[1, 3]
Edit:
Since you asked for performance too, I wrote this test code to check the performances of the three codes proposed so far:
#! /usr/bin/env python3
#-*- coding: utf-8 -*-
import itertools
import numpy as np
import timeit
a2 = np.random.choice([True, False], 1000000)
start_time = timeit.default_timer()
out = [ar.sum() for ar in np.split(a2, np.where(np.diff(a2.astype(int))==1)[0]+1)[1:]]
print(timeit.default_timer() - start_time)
start_time = timeit.default_timer()
foo = [ sum( 1 for _ in group ) for key, group in itertools.groupby( a2 ) if key ]
print(timeit.default_timer() - start_time)
start_time = timeit.default_timer()
l = [l.sum() for l in np.split(a2, np.flatnonzero(~a2)) if l.sum() > 0]
print(timeit.default_timer() - start_time)
print(out == foo == l)
Outputs are always like:
$ python3 test.py
1.702160262999314
2.1189031369995064
4.760941952999929
True
So, the best solution is the one proposed by enke , followed by mine and then the one proposed by richardec
Upvotes: 1
Reputation:
You could use list comprehension with np.split
+ np.flatnonzero
:
l = [l.sum() for l in np.split(a2, np.flatnonzero(~a2)) if l.sum() > 0]
Output:
>>> l
[1, 3]
Upvotes: 1