Expanding numpy based code that detect the frequency of the consecutive number to work on multidimensional array instead of 1D array

Question

This stackoverflow answer provides a simple way (below) to find the frequency and indices of consecutive repeated numbers. This solution is much faster than loop-based code (see the original post above).

boundaries = np.where(np.diff(aa) != 0)[0] + 1 #group boundaries

get_idx_freqs = lambda i, d: (np.concatenate(([0], i))[d >= 2], d[d >= 2])
idx, freqs = get_idx_freqs(boundaries, np.diff(np.r_[0, boundaries, len(aa)]))

and the output

# aa=np.array([1,2,2,3,3,3,4,4,4,4,5,5,5,5,5])
(array([ 1,  3,  6, 10]), array([2, 3, 4, 5]))

# aa=np.array([1,1,1,np.nan,np.nan,1,1,np.nan])
(array([0, 5]), array([3, 2]))

Wondering if this solution could be expanded to work on multidimensional array instead of the slow traditional loop, as the following:

#%%
def get_frequency_of_events_fast(aa):
    boundaries = np.where(np.diff(aa) != 0)[0] + 1 #group boundaries

    get_idx_freqs = lambda i, d: (np.concatenate(([0], i))[d >= 2], d[d >= 2])

    idx, freqs = get_idx_freqs(boundaries, np.diff(np.r_[0, boundaries, len(aa)]))
    return idx,freqs

tmp2_file=np.load('tmp2.npz')
tmp2 = tmp2_file['arr_0']

idx_all=[]
frq_all=[]
for i in np.arange(tmp2.shape[1]):
    for j in np.arange(tmp2.shape[2]):
        print("==>> i, j "+str(i)+' '+str(j))
        idx,freq=get_frequency_of_events_fast(tmp2[:,i,j])
        idx_all.append(idx)
        frq_all.append(freq)
        #if j == 69:
        #    break
        print(idx)
        print(freq)
    #if i == 0:
    #    break

I appended the indices and frequencies to the one dimensional list and also I was wondering if there is a way to append to two dimensional array.

The file could be downloaded from box.com. Here is a sample output

==>> i, j 0 61
[ 27  73 226 250 627 754 760 798 825 891 906]
[ 12   8   5  17 109   5  12  26  30  12   3]
==>> i, j 0 62
[ 29  75 226 250 258 627 754 761 800 889]
[ 11   7   5   6   6 114   5  14  57  21]
==>> i, j 0 63
[ 33 226 622 680 754 762 801 888]
[ 9  5 56 63  5 21 58 26]
==>> i, j 0 64
[ 33 226 615 622 693 753 762 801 889 972 993]
[12  5  4 68 54  6 21 60 26  3  2]
==>> i, j 0 65
[ 39 615 621 693 801 891 972 987 992]
[ 7  3 70 90 61 24  3  2  7]
==>> i, j 0 66
[ 39 617 657 801 891 970 987]
[  7  34 132  63  30   5  13]
==>> i, j 0 67
[ 39  88 621 633 657 680 801 804 891 969 986]
[ 11   4   6   2   6 110   2  63  30   6  14]
==>> i, j 0 68
[ 39  87 681 715 740 766 807 873 891 969 984]
[12  6 33  3 22 24 60  3 31  6 16]

PaulS · Accepted Answer

A possible solution (on my computer, it runs instantaneously):

# data = np.load('tmp2.npz')
# tmp2 = data['arr_0']

def get_freqs(aa):
    boundaries = np.where(np.diff(aa) != 0)[0] + 1
    edges = np.r_[0, boundaries, len(aa)]
    group_lengths = np.diff(edges)
    valid = group_lengths >= 2
    idx = np.concatenate(([0], boundaries))[valid]
    return idx, group_lengths[valid]

out = {
    (i, j): get_freqs(tmp2[:, i, j])
    for i, j in np.ndindex(tmp2.shape[1], tmp2.shape[2])
}

The function computes the starting indices and lengths of consecutive groups in a one-dimensional array where the value remains the same, ignoring groups with fewer than two elements. It does this by first identifying change points using np.diff, then constructing group edges with np.r_ and calculating group lengths with np.diff, and finally filtering groups based on a minimum length criterion. The dictionary comprehension applies this function to every (i, j) slice (i.e., along the first dimension) of the 3-D array tmp2, storing the results in a dictionary keyed by the (i, j) indices.

Since the OP has a large number of cores, numba + parallel processing can very much speed up the calculations:

import numpy as np
from numba import njit, prange
from numba.typed import List

@njit(parallel=True)
def process_all(tmp2):
    T, M, N = tmp2.shape

    out = List()
    for _ in range(M * N):
        out.append((np.empty(0, np.int64), np.empty(0, np.int64)))
        
    for i in prange(M):
        for j in range(N):
            aa = tmp2[:, i, j]
            n = aa.shape[0]

            idx_arr = np.empty(n, np.int64)
            len_arr = np.empty(n, np.int64)
            count = 0
            start = 0
            for k in range(1, n):
                if aa[k] != aa[k - 1]:
                    group_len = k - start
                    if group_len >= 2:
                        idx_arr[count] = start
                        len_arr[count] = group_len
                        count += 1
                    start = k

            group_len = n - start
            if group_len >= 2:
                idx_arr[count] = start
                len_arr[count] = group_len
                count += 1

            out[i * N + j] = (idx_arr[:count].copy(), len_arr[:count].copy())
    return out

data = np.load('/content/tmp2.npz')
tmp2 = data['arr_0']

flat_results = process_all(tmp2)

M, N = tmp2.shape[1], tmp2.shape[2]
results = {}
idx = 0
for i in range(M):
    for j in range(N):
        results[(i, j)] = flat_results[idx]
        idx += 1

Expanding numpy based code that detect the frequency of the consecutive number to work on multidimensional array instead of 1D array

Answers (1)

Related Questions