PythonNut
PythonNut

Reputation: 6379

Fast column shuffle of each row numpy

I have a large 10,000,000+ length array that contains rows. I need to individually shuffle those rows. For example:

[[1,2,3]
 [1,2,3]
 [1,2,3]
 ...
 [1,2,3]]

to

[[3,1,2]
 [2,1,3]
 [1,3,2]
 ...
 [1,2,3]]

I'm currently using

map(numpy.random.shuffle, array)

But it's a python (not NumPy) loop and it's taking 99% of my execution time. Sadly, the PyPy JIT doesn't implement numpypy.random, so I'm out of luck. Is there any faster way? I'm willing to use any library (pandas, scikit-learn, scipy, theano, etc. as long as it uses a Numpy ndarray or a derivative.)

If not, I suppose I'll resort to Cython or C++.

Upvotes: 15

Views: 4377

Answers (4)

rocking_ellipse
rocking_ellipse

Reputation: 295

I believe I have an alternate, equivalent strategy, building upon the previous answers:

# original sequence
a0 = np.arange(3) + 1

# length of original sequence
L = a0.shape[0]

# number of random samples/shuffles
N_samp = 1e4

# from above
all_perm = np.array( (list(itertools.permutations(np.arange(L)))) )
b = all_perm[np.random.randint(0, len(all_perm), size=N_samp)]

# index a with b for each row of b and collapse down to expected dimension
a_samp = a0[np.newaxis, b][0]

I'm not sure how this compares performance-wise, but I like it for its readability.

Upvotes: 0

unutbu
unutbu

Reputation: 879271

If the permutations of the columns are enumerable, then you could do this:

import itertools as IT
import numpy as np

def using_perms(array):
    nrows, ncols = array.shape
    perms = np.array(list(IT.permutations(range(ncols))))
    choices = np.random.randint(len(perms), size=nrows)
    i = np.arange(nrows).reshape(-1, 1)
    return array[i, perms[choices]]

N = 10**7
array = np.tile(np.arange(1,4), (N,1))
print(using_perms(array))

yields (something like)

[[3 2 1]
 [3 1 2]
 [2 3 1]
 [1 2 3]
 [3 1 2]
 ...
 [1 3 2]
 [3 1 2]
 [3 2 1]
 [2 1 3]
 [1 3 2]]

Here is a benchmark comparing it to

def using_shuffle(array):
    map(numpy.random.shuffle, array)
    return array

In [151]: %timeit using_shuffle(array)
1 loops, best of 3: 7.17 s per loop

In [152]: %timeit using_perms(array)
1 loops, best of 3: 2.78 s per loop

Edit: CT Zhu's method is faster than mine:

def using_Zhu(array):
    nrows, ncols = array.shape    
    all_perm = np.array((list(itertools.permutations(range(ncols)))))
    b = all_perm[np.random.randint(0, all_perm.shape[0], size=nrows)]
    return (array.flatten()[(b+3*np.arange(nrows)[...,np.newaxis]).flatten()]
            ).reshape(array.shape)

In [177]: %timeit using_Zhu(array)
1 loops, best of 3: 1.7 s per loop

Here is a slight variation of Zhu's method which may be even a bit faster:

def using_Zhu2(array):
    nrows, ncols = array.shape    
    all_perm = np.array((list(itertools.permutations(range(ncols)))))
    b = all_perm[np.random.randint(0, all_perm.shape[0], size=nrows)]
    return array.take((b+3*np.arange(nrows)[...,np.newaxis]).ravel()).reshape(array.shape)

In [201]: %timeit using_Zhu2(array)
1 loops, best of 3: 1.46 s per loop

Upvotes: 8

CT Zhu
CT Zhu

Reputation: 54330

Here are some ideas:

In [10]: a=np.zeros(shape=(1000,3))

In [12]: a[:,0]=1

In [13]: a[:,1]=2

In [14]: a[:,2]=3

In [17]: %timeit map(np.random.shuffle, a)
100 loops, best of 3: 4.65 ms per loop

In [21]: all_perm=np.array((list(itertools.permutations([0,1,2]))))

In [22]: b=all_perm[np.random.randint(0,6,size=1000)]

In [25]: %timeit (a.flatten()[(b+3*np.arange(1000)[...,np.newaxis]).flatten()]).reshape(a.shape)
1000 loops, best of 3: 393 us per loop

If there are only a few columns, then the number of all possible permutation is much smaller than the number of rows in the array (in this case, when there are only 3 columns, there are only 6 possible permutations). A way to make it faster is to make all the permutations at once first and then rearrange each row by randomly picking one permutation from all possible permutations.

It still appears to be 10 times faster even with larger dimension:

#adjust a accordingly
In [32]: b=all_perm[np.random.randint(0,6,size=1000000)]

In [33]: %timeit (a.flatten()[(b+3*np.arange(1000000)[...,np.newaxis]).flatten()]).reshape(a.shape)
1 loops, best of 3: 348 ms per loop

In [34]: %timeit map(np.random.shuffle, a)
1 loops, best of 3: 4.64 s per loop

Upvotes: 8

waitingkuo
waitingkuo

Reputation: 93754

You can also try the apply function in pandas

import pandas as pd

df = pd.DataFrame(array)
df = df.apply(lambda x:np.random.shuffle(x) or x, axis=1)

And then extract the numpy array from the dataframe

print df.values

Upvotes: 0

Related Questions