Python shuffle array that has very few non zeros (very sparsey)

Question

I have a very big (length ~ 150 millions) numpy array that has very few non zero values (about 99.9% of the array is 0). I want to shuffle it, but the shuffle is slow (it takes about 10 seconds, which is not acceptable because I am doing Monte Carlo simulations). Is there a way to shuffle it in a way that takes into account the fact that my array is mostly composed of 0?

I am thinking of shuffling just my positive values and then insert it randomly in an array full of 0's, but I cannot find a numpy function for that.

Daniel F · Accepted Answer

Similar to @Divakar's method, but using scipy.sparse:

a = scipy.sparse.coo_matrix(a)

def shuffle_sparse_coo(a):
    a.col = np.random.choice(a.shape[1], a.nnz, replace=0)
    return a

shuffle_sparse_coo(a).todense() # Using Divakar's 'a' array
Out[408]: matrix([[0, 8, 0, 0, 7, 0, 0, 0, 0, 4, 0, 0, 5, 0, 3, 0, 1, 0, 0, 0]])

~~EDIT:~~

If you want to stay dense, I'm pretty sure this barely beats even @Divakar's hackish method:

%timeit shuffle_sparse_arr_hackish(a) The slowest run took 4.32 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 44.7 µs per loop def shuffle_sparse_arr_nz(a): out = np.zeros_like(a) mask = np.nonzero(a) idx = np.random.choice(a.size, mask[0].size, replace=0) out[idx] = a[mask] return out %timeit shuffle_sparse_arr_nz(a) The slowest run took 4.68 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 41 µs per loop

EDIT2:

Implementing @Divakar's hack into the sparse method:

def shuffle_sparse_coo_h(a):
    idx = np.unique((a.shape[1]*np.random.rand(2*a.nnz)).astype(int))[:a.nnz]
    while idx.size



EDIT2:

Further improvement using np.random.randint

def shuffle_sparse_coo_h2(a):
    idx = np.unique(np.random.randint(0,a.shape[1],(2*a.nnz,)))[:a.nnz]
    while idx.size < n:
        idx = np.unique(np.random.randint(0,a.shape[1],(2*a.nnz,)))[:a.nnz]
    a.col = idx
    return a  

%timeit shuffle_sparse_coo_h2(a1)
1000 loops, best of 3: 1.86 ms per loop

Python shuffle array that has very few non zeros (very sparsey)

Answers (2)

Related Questions