BillyBoy
BillyBoy

Reputation: 47

Python shuffle array that has very few non zeros (very sparsey)

I have a very big (length ~ 150 millions) numpy array that has very few non zero values (about 99.9% of the array is 0). I want to shuffle it, but the shuffle is slow (it takes about 10 seconds, which is not acceptable because I am doing Monte Carlo simulations). Is there a way to shuffle it in a way that takes into account the fact that my array is mostly composed of 0?

I am thinking of shuffling just my positive values and then insert it randomly in an array full of 0's, but I cannot find a numpy function for that.

Upvotes: 4

Views: 160

Answers (2)

Daniel F
Daniel F

Reputation: 14399

Similar to @Divakar's method, but using scipy.sparse:

a = scipy.sparse.coo_matrix(a)

def shuffle_sparse_coo(a):
    a.col = np.random.choice(a.shape[1], a.nnz, replace=0)
    return a

shuffle_sparse_coo(a).todense() # Using Divakar's 'a' array
Out[408]: matrix([[0, 8, 0, 0, 7, 0, 0, 0, 0, 4, 0, 0, 5, 0, 3, 0, 1, 0, 0, 0]])

EDIT:

If you want to stay dense, I'm pretty sure this barely beats even @Divakar's hackish method:

%timeit shuffle_sparse_arr_hackish(a)
The slowest run took 4.32 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 44.7 µs per loop

def shuffle_sparse_arr_nz(a):
    out = np.zeros_like(a)
    mask = np.nonzero(a)
    idx = np.random.choice(a.size, mask[0].size, replace=0)
    out[idx] = a[mask]
    return out

%timeit shuffle_sparse_arr_nz(a)
The slowest run took 4.68 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 41 µs per loop

EDIT2:

Implementing @Divakar's hack into the sparse method:

def shuffle_sparse_coo_h(a):
    idx = np.unique((a.shape[1]*np.random.rand(2*a.nnz)).astype(int))[:a.nnz]
    while idx.size<n:
        idx = np.unique((a.shape[1]*np.random.rand(2*a.nnz)).astype(int))[:a.nnz]
    a.col = idx
    return a

#using a 15m element a:

%timeit shuffle_sparse_arr_hackish(a)
10 loops, best of 3: 52.8 ms per loop

a1 = sparse.coo_matrix(a)
%timeit shuffle_sparse_coo(a1)
1 loop, best of 3: 1.01 s per loop

%timeit shuffle_sparse_coo_h(a1)
The slowest run took 4.58 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.02 ms per loop

EDIT2:

Further improvement using np.random.randint

def shuffle_sparse_coo_h2(a):
    idx = np.unique(np.random.randint(0,a.shape[1],(2*a.nnz,)))[:a.nnz]
    while idx.size < n:
        idx = np.unique(np.random.randint(0,a.shape[1],(2*a.nnz,)))[:a.nnz]
    a.col = idx
    return a  

%timeit shuffle_sparse_coo_h2(a1)
1000 loops, best of 3: 1.86 ms per loop

Upvotes: 2

Divakar
Divakar

Reputation: 221574

Approach #1 : Here's one approach -

def shuffle_sparse_arr(a):
    out = np.zeros_like(a)
    mask = a!=0
    n = np.count_nonzero(mask)
    idx = np.random.choice(a.size, n, replace=0)
    out[idx] = a[mask]
    return out

Approach #2 : Hackish way -

def shuffle_sparse_arr_hackish(a):
    out = np.zeros_like(a)
    mask = a!=0
    n = np.count_nonzero(mask)
    idx = np.unique((a.size*np.random.rand(2*n)).astype(int))[:n]
    while idx.size<n:
        idx = np.unique((a.size*np.random.rand(2*n)).astype(int))[:n]
    np.random.shuffle(idx)
    out[idx] = a[mask]
    return out

Sample runs -

In [269]: # Setup input array
     ...: a = np.zeros((20),dtype=int)
     ...: sidx = np.random.choice(a.size, 6, replace=0)
     ...: a[sidx] = [5,8,4,1,7,3]
     ...: 

In [270]: a
Out[270]: array([4, 0, 0, 8, 0, 0, 5, 0, 0, 0, 0, 7, 0, 0, 1, 0, 0, 0, 0, 3])

In [271]: shuffle_sparse_arr(a)
Out[271]: array([0, 5, 0, 0, 0, 0, 1, 0, 4, 0, 0, 0, 0, 0, 0, 7, 3, 8, 0, 0])

In [272]: shuffle_sparse_arr_hackish(a)
Out[272]: array([3, 1, 5, 0, 4, 0, 7, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Runtime test -

In [288]: # Setup input array with 15 million and 99.9% zeros
     ...: a = np.zeros((15000000),dtype=int)
     ...: 
     ...: # Set 100-99.9% as random non-zeros
     ...: n = int(a.size*((100-99.9)/100)) 
     ...: 
     ...: set_idx = np.random.choice(a.size, n , replace=0)
     ...: nums = np.random.choice(a.size, n , replace=0)
     ...: a[set_idx] = nums
     ...: 

In [289]: %timeit shuffle_sparse_arr(a)
1 loops, best of 3: 647 ms per loop

In [290]: %timeit shuffle_sparse_arr_hackish(a)
10 loops, best of 3: 29.1 ms per loop

In [291]: %timeit np.random.shuffle(a)
1 loops, best of 3: 606 ms per loop

Upvotes: 5

Related Questions