Reputation: 743
This stackoverflow answer provides a simple way (below) to find the frequency and indices of consecutive repeated numbers. This solution is much faster than loop-based code (see the original post above).
boundaries = np.where(np.diff(aa) != 0)[0] + 1 #group boundaries
get_idx_freqs = lambda i, d: (np.concatenate(([0], i))[d >= 2], d[d >= 2])
idx, freqs = get_idx_freqs(boundaries, np.diff(np.r_[0, boundaries, len(aa)]))
and the output
# aa=np.array([1,2,2,3,3,3,4,4,4,4,5,5,5,5,5])
(array([ 1, 3, 6, 10]), array([2, 3, 4, 5]))
# aa=np.array([1,1,1,np.nan,np.nan,1,1,np.nan])
(array([0, 5]), array([3, 2]))
Wondering if this solution could be expanded to work on multidimensional array instead of the slow traditional loop, as the following:
#%%
def get_frequency_of_events_fast(aa):
boundaries = np.where(np.diff(aa) != 0)[0] + 1 #group boundaries
get_idx_freqs = lambda i, d: (np.concatenate(([0], i))[d >= 2], d[d >= 2])
idx, freqs = get_idx_freqs(boundaries, np.diff(np.r_[0, boundaries, len(aa)]))
return idx,freqs
tmp2_file=np.load('tmp2.npz')
tmp2 = tmp2_file['arr_0']
idx_all=[]
frq_all=[]
for i in np.arange(tmp2.shape[1]):
for j in np.arange(tmp2.shape[2]):
print("==>> i, j "+str(i)+' '+str(j))
idx,freq=get_frequency_of_events_fast(tmp2[:,i,j])
idx_all.append(idx)
frq_all.append(freq)
#if j == 69:
# break
print(idx)
print(freq)
#if i == 0:
# break
I appended the indices and frequencies to the one dimensional list and also I was wondering if there is a way to append to two dimensional array.
The file could be downloaded from box.com. Here is a sample output
==>> i, j 0 61
[ 27 73 226 250 627 754 760 798 825 891 906]
[ 12 8 5 17 109 5 12 26 30 12 3]
==>> i, j 0 62
[ 29 75 226 250 258 627 754 761 800 889]
[ 11 7 5 6 6 114 5 14 57 21]
==>> i, j 0 63
[ 33 226 622 680 754 762 801 888]
[ 9 5 56 63 5 21 58 26]
==>> i, j 0 64
[ 33 226 615 622 693 753 762 801 889 972 993]
[12 5 4 68 54 6 21 60 26 3 2]
==>> i, j 0 65
[ 39 615 621 693 801 891 972 987 992]
[ 7 3 70 90 61 24 3 2 7]
==>> i, j 0 66
[ 39 617 657 801 891 970 987]
[ 7 34 132 63 30 5 13]
==>> i, j 0 67
[ 39 88 621 633 657 680 801 804 891 969 986]
[ 11 4 6 2 6 110 2 63 30 6 14]
==>> i, j 0 68
[ 39 87 681 715 740 766 807 873 891 969 984]
[12 6 33 3 22 24 60 3 31 6 16]
Upvotes: 2
Views: 130
Reputation: 25528
A possible solution (on my computer, it runs instantaneously):
# data = np.load('tmp2.npz')
# tmp2 = data['arr_0']
def get_freqs(aa):
boundaries = np.where(np.diff(aa) != 0)[0] + 1
edges = np.r_[0, boundaries, len(aa)]
group_lengths = np.diff(edges)
valid = group_lengths >= 2
idx = np.concatenate(([0], boundaries))[valid]
return idx, group_lengths[valid]
out = {
(i, j): get_freqs(tmp2[:, i, j])
for i, j in np.ndindex(tmp2.shape[1], tmp2.shape[2])
}
The function computes the starting indices and lengths of consecutive groups in a one-dimensional array where the value remains the same, ignoring groups with fewer than two elements. It does this by first identifying change points using np.diff
, then constructing group edges with np.r_
and calculating group lengths with np.diff
, and finally filtering groups based on a minimum length criterion. The dictionary comprehension applies this function to every (i, j)
slice (i.e., along the first dimension) of the 3-D array tmp2
, storing the results in a dictionary keyed by the (i, j)
indices.
Since the OP has a large number of cores, numba + parallel processing
can very much speed up the calculations:
import numpy as np
from numba import njit, prange
from numba.typed import List
@njit(parallel=True)
def process_all(tmp2):
T, M, N = tmp2.shape
out = List()
for _ in range(M * N):
out.append((np.empty(0, np.int64), np.empty(0, np.int64)))
for i in prange(M):
for j in range(N):
aa = tmp2[:, i, j]
n = aa.shape[0]
idx_arr = np.empty(n, np.int64)
len_arr = np.empty(n, np.int64)
count = 0
start = 0
for k in range(1, n):
if aa[k] != aa[k - 1]:
group_len = k - start
if group_len >= 2:
idx_arr[count] = start
len_arr[count] = group_len
count += 1
start = k
group_len = n - start
if group_len >= 2:
idx_arr[count] = start
len_arr[count] = group_len
count += 1
out[i * N + j] = (idx_arr[:count].copy(), len_arr[:count].copy())
return out
data = np.load('/content/tmp2.npz')
tmp2 = data['arr_0']
flat_results = process_all(tmp2)
M, N = tmp2.shape[1], tmp2.shape[2]
results = {}
idx = 0
for i in range(M):
for j in range(N):
results[(i, j)] = flat_results[idx]
idx += 1
Upvotes: 2