Reputation: 47
I have a very big (length ~ 150 millions) numpy array that has very few non zero values (about 99.9% of the array is 0). I want to shuffle it, but the shuffle is slow (it takes about 10 seconds, which is not acceptable because I am doing Monte Carlo simulations). Is there a way to shuffle it in a way that takes into account the fact that my array is mostly composed of 0?
I am thinking of shuffling just my positive values and then insert it randomly in an array full of 0
's, but I cannot find a numpy function for that.
Upvotes: 4
Views: 160
Reputation: 14399
Similar to @Divakar's method, but using scipy.sparse
:
a = scipy.sparse.coo_matrix(a)
def shuffle_sparse_coo(a):
a.col = np.random.choice(a.shape[1], a.nnz, replace=0)
return a
shuffle_sparse_coo(a).todense() # Using Divakar's 'a' array
Out[408]: matrix([[0, 8, 0, 0, 7, 0, 0, 0, 0, 4, 0, 0, 5, 0, 3, 0, 1, 0, 0, 0]])
EDIT:
If you want to stay dense, I'm pretty sure this barely beats even @Divakar's hackish method:
%timeit shuffle_sparse_arr_hackish(a)
The slowest run took 4.32 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 44.7 µs per loop
def shuffle_sparse_arr_nz(a):
out = np.zeros_like(a)
mask = np.nonzero(a)
idx = np.random.choice(a.size, mask[0].size, replace=0)
out[idx] = a[mask]
return out
%timeit shuffle_sparse_arr_nz(a)
The slowest run took 4.68 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 41 µs per loop
EDIT2:
Implementing @Divakar's hack into the sparse method:
def shuffle_sparse_coo_h(a):
idx = np.unique((a.shape[1]*np.random.rand(2*a.nnz)).astype(int))[:a.nnz]
while idx.size<n:
idx = np.unique((a.shape[1]*np.random.rand(2*a.nnz)).astype(int))[:a.nnz]
a.col = idx
return a
#using a 15m element a:
%timeit shuffle_sparse_arr_hackish(a)
10 loops, best of 3: 52.8 ms per loop
a1 = sparse.coo_matrix(a)
%timeit shuffle_sparse_coo(a1)
1 loop, best of 3: 1.01 s per loop
%timeit shuffle_sparse_coo_h(a1)
The slowest run took 4.58 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.02 ms per loop
EDIT2:
Further improvement using np.random.randint
def shuffle_sparse_coo_h2(a):
idx = np.unique(np.random.randint(0,a.shape[1],(2*a.nnz,)))[:a.nnz]
while idx.size < n:
idx = np.unique(np.random.randint(0,a.shape[1],(2*a.nnz,)))[:a.nnz]
a.col = idx
return a
%timeit shuffle_sparse_coo_h2(a1)
1000 loops, best of 3: 1.86 ms per loop
Upvotes: 2
Reputation: 221574
Approach #1 : Here's one approach -
def shuffle_sparse_arr(a):
out = np.zeros_like(a)
mask = a!=0
n = np.count_nonzero(mask)
idx = np.random.choice(a.size, n, replace=0)
out[idx] = a[mask]
return out
Approach #2 : Hackish way -
def shuffle_sparse_arr_hackish(a):
out = np.zeros_like(a)
mask = a!=0
n = np.count_nonzero(mask)
idx = np.unique((a.size*np.random.rand(2*n)).astype(int))[:n]
while idx.size<n:
idx = np.unique((a.size*np.random.rand(2*n)).astype(int))[:n]
np.random.shuffle(idx)
out[idx] = a[mask]
return out
Sample runs -
In [269]: # Setup input array
...: a = np.zeros((20),dtype=int)
...: sidx = np.random.choice(a.size, 6, replace=0)
...: a[sidx] = [5,8,4,1,7,3]
...:
In [270]: a
Out[270]: array([4, 0, 0, 8, 0, 0, 5, 0, 0, 0, 0, 7, 0, 0, 1, 0, 0, 0, 0, 3])
In [271]: shuffle_sparse_arr(a)
Out[271]: array([0, 5, 0, 0, 0, 0, 1, 0, 4, 0, 0, 0, 0, 0, 0, 7, 3, 8, 0, 0])
In [272]: shuffle_sparse_arr_hackish(a)
Out[272]: array([3, 1, 5, 0, 4, 0, 7, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Runtime test -
In [288]: # Setup input array with 15 million and 99.9% zeros
...: a = np.zeros((15000000),dtype=int)
...:
...: # Set 100-99.9% as random non-zeros
...: n = int(a.size*((100-99.9)/100))
...:
...: set_idx = np.random.choice(a.size, n , replace=0)
...: nums = np.random.choice(a.size, n , replace=0)
...: a[set_idx] = nums
...:
In [289]: %timeit shuffle_sparse_arr(a)
1 loops, best of 3: 647 ms per loop
In [290]: %timeit shuffle_sparse_arr_hackish(a)
10 loops, best of 3: 29.1 ms per loop
In [291]: %timeit np.random.shuffle(a)
1 loops, best of 3: 606 ms per loop
Upvotes: 5