Glxblt76
Glxblt76

Reputation: 395

Clustering 2D numpy arrays into smaller 4D numpy arrays

Dear Stackoverflow users,

My python script is facing performance issues because I have to crunch 2D tables with more than 1 billion elements for potential lists of hundreds of input files. I've been replacing my nested for loops by numpy array manipulation callings and in this process, I found that numpy.take (which finds elements according to a set of indices) and numpy.outer (which evaluates all possible products between two 1D array elements) are extraordinarily useful. These functions allowed me to multiply the performance of my code by several hundreds where I could use them.

But there is still a place in my code where I have a problem, and this place is where I cluster my 2D array with say a billion elements into a 4D array with far fewer elements (like, some thousands). Specifically, I have two lists of indices to which the size is equal to the number of rows of the matrix (which is a square matrix).

The first list of indices is th_t, the second list is dm_t, and the matrix is p_contact. The 4D array of clustered elements is named rc_p. The clustering procedure is the following nested for loop:

import numpy as np

th_t = [1, 3, 2, 1, 1, 3, 3, 0, 1, 0, 2, 1]
dm_t = [0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0]
n_th = len(set(th_t))
n_dm = len(set(dm_t))
p_contact = [[0.0129, 0.0134, 0.0062, 0.0021, 0.0107, 0.0106, 0.0076, 0.0134, 0.0087, 0.0031, 0.0026, 0.0114]
[0.0123, 0.0021, 0.0033, 0.0120, 0.0099, 0.0125, 0.0001, 0.0018, 0.0030, 0.0059, 0.0038, 0.0125]
[0.0082, 0.0125, 0.0004, 0.0120, 0.0040, 0.0108, 0.0101, 0.0063, 0.0072, 0.0098, 0.0017, 0.0121]
[0.0096, 0.0008, 0.0073, 0.0100, 0.0123, 0.0104, 0.0077, 0.0025, 0.0106, 0.0126, 0.0031, 0.0033]
[0.0112, 0.0091, 0.0134, 0.0002, 0.0129, 0.0081, 0.0087, 0.0036, 0.0102, 0.0002, 0.0019, 0.0131]
[0.0099, 0.0081, 0.0037, 0.0004, 0.0135, 0.0005, 0.0025, 0.0086, 0.0091, 0.0016, 0.0130, 0.0011]
[0.0078, 0.0005, 0.0044, 0.0089, 0.0127, 0.0106, 0.0113, 0.0048, 0.0057, 0.0133, 0.0077, 0.0033]
[0.0017, 0.0010, 0.0048, 0.0052, 0.0113, 0.0066, 0.0133, 0.0092, 0.0020, 0.0125, 0.0011, 0.0023]
[0.0027, 0.0124, 0.0096, 0.0047, 0.0134, 0.0020, 0.0129, 0.0114, 0.0087, 0.0114, 0.0090, 0.0001]
[0.0032, 0.0014, 0.0038, 0.0114, 0.0058, 0.0017, 0.0089, 0.0057, 0.0022, 0.0056, 0.0046, 0.0094]
[0.0033, 0.0020, 0.0042, 0.0040, 0.0110, 0.0016, 0.0100, 0.0014, 0.0087, 0.0123, 0.0004, 0.0031]
[0.0010, 0.0029, 0.0054, 0.0015, 0.0064, 0.0060, 0.0131, 0.0064, 0.0073, 0.0097, 0.0132, 0.0092]]
n_sg = len(p_contact)
rc_p = np.zeros((n_th, n_dm, n_th, n_dm)) 
for i in range(n_sg): #n_sg can be about 40000
        for j in range(n_sg):
            rc_p[th_t[i]][dm_t[i]][th_t[j]][dm_t[j]] += p_contact[i][j]

I tried to use various numpy functions to avoid this nested for loop over a billion elements, and I ended up with the following procedure:

import numpy as np

th_t = [1, 3, 2, 1, 1, 3, 3, 0, 1, 0, 2, 1]
dm_t = [0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0]
n_th = len(set(th_t))
n_dm = len(set(dm_t))
p_contact = [[0.0129, 0.0134, 0.0062, 0.0021, 0.0107, 0.0106, 0.0076, 0.0134, 0.0087, 0.0031, 0.0026, 0.0114]
[0.0123, 0.0021, 0.0033, 0.0120, 0.0099, 0.0125, 0.0001, 0.0018, 0.0030, 0.0059, 0.0038, 0.0125]
[0.0082, 0.0125, 0.0004, 0.0120, 0.0040, 0.0108, 0.0101, 0.0063, 0.0072, 0.0098, 0.0017, 0.0121]
[0.0096, 0.0008, 0.0073, 0.0100, 0.0123, 0.0104, 0.0077, 0.0025, 0.0106, 0.0126, 0.0031, 0.0033]
[0.0112, 0.0091, 0.0134, 0.0002, 0.0129, 0.0081, 0.0087, 0.0036, 0.0102, 0.0002, 0.0019, 0.0131]
[0.0099, 0.0081, 0.0037, 0.0004, 0.0135, 0.0005, 0.0025, 0.0086, 0.0091, 0.0016, 0.0130, 0.0011]
[0.0078, 0.0005, 0.0044, 0.0089, 0.0127, 0.0106, 0.0113, 0.0048, 0.0057, 0.0133, 0.0077, 0.0033]
[0.0017, 0.0010, 0.0048, 0.0052, 0.0113, 0.0066, 0.0133, 0.0092, 0.0020, 0.0125, 0.0011, 0.0023]
[0.0027, 0.0124, 0.0096, 0.0047, 0.0134, 0.0020, 0.0129, 0.0114, 0.0087, 0.0114, 0.0090, 0.0001]
[0.0032, 0.0014, 0.0038, 0.0114, 0.0058, 0.0017, 0.0089, 0.0057, 0.0022, 0.0056, 0.0046, 0.0094]
[0.0033, 0.0020, 0.0042, 0.0040, 0.0110, 0.0016, 0.0100, 0.0014, 0.0087, 0.0123, 0.0004, 0.0031]
[0.0010, 0.0029, 0.0054, 0.0015, 0.0064, 0.0060, 0.0131, 0.0064, 0.0073, 0.0097, 0.0132, 0.0092]]

#prepare the flattened list of index pairs
th_t        = np.asarray(th_t)
dm_t        = np.asarray(dm_t)
thdm_stack  = np.stack((th_t, dm_t))
thdm_stack  = np.transpose(thdm_stack)
thdm_table  = np.asarray(list(product(thdm_stack, thdm_stack)))
p_contact_f = p_contact.flatten()

#calculate clustered probabilities for each contact type
rc_p                 = np.zeros((n_th, n_dm, n_th, n_dm)) 

for th1 in range(n_th):
    for dm1 in range(n_dm):
        for th2 in range(n_th):
            for dm2 in range(n_dm):
                to_find                       = np.zeros((2, 2))
                to_find[0][0]                 = th1
                to_find[0][1]                 = dm1
                to_find[1][0]                 = th2
                to_find[1][1]                 = dm2
                condition                     = np.isin(thdm_table, to_find)
                condition                     = np.all(condition, axis=(1, 2))
                to_add                        = np.extract(condition, p_contact_f)
                rc_p[th1][dm1][th2][dm2]      = np.sum(to_add)

which ends up being slower than the original procedure, not faster, probably because I have to generate a 1-billion sized boolean matrix and treat it at each of the thousands steps of the 4D for loop (which has thousands less elements than the initial for loop, just to remind).

So, does any of you have some idea about how I could replace this costly nested for loop and take maximum advantage of the underlying C-code of numpy to cluster this big 2D matrix into the much smaller 4D array?

Note that the individual elements in these arrays are probabilities. The total sum of all elements in the 2D array and in the 4D clustered array is 1, and by "clustering", I mean grouping probabilities into types (all cells of the 2D matrix which display identical sets of indices get their probabilities added into one of the 4D clustered array elements).

All the best!

Upvotes: 1

Views: 323

Answers (2)

Daniel F
Daniel F

Reputation: 14399

You're not actually iterating over four dimensions, you're iterating over 2: i and j. You can np.ravel_multi_index together your th_t and dm_t arrays to reduce the problem to 2d, and reshape it back to 4d at the end:

idx = np.ravel_multi_index((th_t, dm_t), (n_th, n_dm))
rc_p = np.zeros((n_th * n_dm, n_th * n_dm))
for i in range (idx.size):
    np.add.at(rc_p[idx[i]], idx, p_contact[i])
rc_p = rc_p.reshape(n_th, n_dm, n_th, n_dm)

Or, if you can use numba, just wrap your initial loopy code in a @jit, which will c-compile it

from numba import jit

@jit
def foo(p_contact, th_t, dm_t, n_th, n_dm):
    n_sg = len(p_contact)
    rc_p = np.zeros((n_th, n_dm, n_th, n_dm)) 
    for i in range(n_sg): 
        for j in range(n_sg):
            rc_p[th_t[i]][dm_t[i]][th_t[j]][dm_t[j]] += p_contact[i][j]

Upvotes: 1

Glxblt76
Glxblt76

Reputation: 395

I want to emphasize the very useful function proposed by Daniel F which I did not know and was key to fix this issue:

numpy.ravel_multi_index

It can transform index sequences into 1D index list. For example, with an index pair based on two lists of 2 and 9 indices, the 1,4 index outputted by this numpy function is the 14th index (9 indices plus 5). It is a bit tricky to understand, but very powerful.

Upvotes: 0

Related Questions