0vbb
0vbb

Reputation: 903

Numpy: shuffle arrays in unison multiple times with different seeds

I have multiple numpy arrays with the same number of rows (axis_0) that I'd like to shuffle in unison. After one shuffle, I'd like to shuffle them again with a different random seed.


Till now, I've used the solution from Better way to shuffle two numpy arrays in unison :

def shuffle_in_unison(a, b):
    rng_state = numpy.random.get_state()
    numpy.random.shuffle(a)
    numpy.random.set_state(rng_state)
    numpy.random.shuffle(b)

However, this doesn't work for multiple unison shuffles, since rng_state is always the same.


I've tried to use RandomState in order to get a different seed for each call, but this doesn't even work for a single unison shuffle:

a = np.array([1,2,3,4,5])
b = np.array([10,20,30,40,50])

def shuffle_in_unison(a, b):
    r = np.random.RandomState() # different state from /dev/urandom for each call
    state = r.get_state()
    np.random.shuffle(a) # array([4, 2, 1, 5, 3])
    np.random.set_state(state)
    np.random.shuffle(b) # array([40, 20, 50, 10, 30])
    # -> doesn't work
    return a,b

for i in xrange(10):
    a,b = shuffle_in_unison(a,b)
    print a,b

What am I doing wrong?



Edit:

For everyone that doesn't have huge arrays like me, just use the solution by Francesco (https://stackoverflow.com/a/47156309/3955022):

def shuffle_in_unison(a, b):
    n_elem = a.shape[0]
    indeces = np.random.permutation(n_elem)
    return a[indeces], b[indeces]

The only drawback is that this is not an in-place operation, which is a pity for large arrays like mine (500G).

Upvotes: 2

Views: 3324

Answers (3)

Isaac B
Isaac B

Reputation: 755

I don't normally have to shuffle my data more than once at a time. But this function accommodates any number of input arrays, as well as any number of random shuffles - and it shuffles in-place.

import numpy as np


def shuffle_arrays(arrays, shuffle_quant=1):
    assert all(len(arr) == len(arrays[0]) for arr in arrays)
    max_int = 2**(32 - 1) - 1

    for i in range(shuffle_quant):
        seed = np.random.randint(0, max_int)
        for arr in arrays:
            rstate = np.random.RandomState(seed)
            rstate.shuffle(arr)

And can be used like this

a = np.array([1, 2, 3, 4, 5])
b = np.array([10,20,30,40,50])
c = np.array([[1,10,11], [2,20,22], [3,30,33], [4,40,44], [5,50,55]])

shuffle_arrays([a, b, c], shuffle_quant=5)

A few things to note:

  • Method uses NumPy and no other packages.
  • The assert ensures that all input arrays have the same length along their first dimension.
  • The max_int keeps random seed within int32 range.
  • Arrays shuffled in-place by their first dimension - nothing returned.

After the shuffle, the data can be split using np.split or referenced using slices - depending on the application.

Upvotes: 1

Francesco Montesano
Francesco Montesano

Reputation: 8658

I don't know what are you doing wrong with the way you set the state. However I found an alternative solution: instead of shuffling n arrays, shuffle their indeces only once with numpy.random.choice and then reorder all the arrays.

a = np.array([1,2,3,4,5])
b = np.array([10,20,30,40,5])

def shuffle_in_unison(a, b):
     n_elem = a.shape[0]
     indeces = np.random.choice(n_elem, size=n_elem, replace=False)
     return a[indeces], b[indeces]

 for i in xrange(5):
     a, b = shuffle_in_unison(a ,b)
     print(a, b)

I get:

[5 2 4 3 1] [50 20 40 30 10]
[1 3 4 2 5] [10 30 40 20 50]
[1 2 5 4 3] [10 20 50 40 30]
[3 2 1 4 5] [30 20 10 40 50]
[1 2 5 3 4] [10 20 50 30 40]

edit

Thanks to @Divakar for the suggestion. Here is a more readable way to obtain the same result using numpy.random.premutation

def shuffle_in_unison(a, b):
     n_elem = a.shape[0]
     indeces = np.random.permutation(n_elem)
     return a[indeces], b[indeces]

Upvotes: 4

cardamom
cardamom

Reputation: 7421

I don't know exactly what you are doing well, but you have not chosen the solution with the most votes on that page or with the second most votes. Try this one:

from sklearn.utils import shuffle
for i in range(10):
    X, Y = shuffle(X, Y, random_state=i)
    print ("X - ", X, "Y - ", Y)

Output:

X -  [3 5 1 4 2] Y -  [30 50 10 40 20]
X -  [1 5 2 3 4] Y -  [10 50 20 30 40]
X -  [2 4 5 3 1] Y -  [20 40 50 30 10]
X -  [3 1 4 2 5] Y -  [30 10 40 20 50]
X -  [3 2 1 5 4] Y -  [30 20 10 50 40]
X -  [4 3 2 1 5] Y -  [40 30 20 10 50]
X -  [1 5 4 3 2] Y -  [10 50 40 30 20]
X -  [1 3 4 5 2] Y -  [10 30 40 50 20]
X -  [2 4 3 1 5] Y -  [20 40 30 10 50]
X -  [1 2 4 3 5] Y -  [10 20 40 30 50]

Upvotes: 2

Related Questions