Reputation: 41

How to shuffle data in python keeping some n number of rows intact

I want to shuffle my data in such manner that each 4 rows remain intact. For example I have 16 rows then first 4 rows can go to last and then second four rows may go to third and so on in any particular order. I am trying to do thins in python

Upvotes: 1

Answers (3)

Piyush Patel

Reputation: 56

Below code in python does the magic

from random import shuffle
import numpy as np
from math import ceil

#creating sample dataset
d=[[i*4 +j for i in range(5)] for j in range(25)]
a = np.array(d, int)
print '--------------Input--------------'
print a

gl=4 #group length i.e number of rows needs to be intact
parts=ceil(1.0*len(a)/gl) #no of partitions based on grouplength for the given dataset

#creating partition list and shuffling it to use later 
x = [i for i in range(int(parts))]
shuffle(x)

#Creates new dataset based on shuffled partition list
fg=x.pop(0)
f = a[gl*fg:gl*(fg+1)]
for i in x: 
 t=a[gl*i:(i+1)*gl]
 f=np.concatenate((f, t), axis=0)
print '--------------Output--------------'
print f

Upvotes: 0

Divakar

Reputation: 221614

Reshape splitting the first axis into two with the later of length same as the group length = 4, giving us a 3D array and then use np.random.shuffle, which shuffles along the first axis. The reshaped version being a view into the original array, assigns back the results directly into it. Being in-situ, this should be pretty efficient (both memory-wise and on performance).

Hence, the implementation would be as simple as this -

def array_shuffle(a, n=4):
    a3D = a.reshape(a.shape[0]//n,n,-1) # a is input array
    np.random.shuffle(a3D)

Another variant of it would be to generate random permutations covering the length of the 3D array, then indexing into it with those and finally reshaping back to 2D.This makes a copy, but seems more performant than in-situ edits as shown in the previous method.

The implementation would be -

def array_permuted_indexing(a, n=4):
    m = a.shape[0]//n
    a3D = a.reshape(m, n, -1)
    return a3D[np.random.permutation(m)].reshape(-1,a3D.shape[-1])

Step-by-step run on shuffling method -

1] Setup random input array and split into a 3D version :

In [2]: np.random.seed(0)

In [3]: a = np.random.randint(11,99,(16,3))

In [4]: a3D = a.reshape(a.shape[0]//4,4,-1)

In [5]: a
Out[5]: 
array([[55, 58, 75],
       [78, 78, 20],
       [94, 32, 47],
       [98, 81, 23],
       [69, 76, 50],
       [98, 57, 92],
       [48, 36, 88],
       [83, 20, 31],
       [91, 80, 90],
       [58, 75, 93],
       [60, 40, 30],
       [30, 25, 50],
       [43, 76, 20],
       [68, 43, 42],
       [85, 34, 46],
       [86, 66, 39]])

2] Check the 3D array :

In [6]: a3D
Out[6]: 
array([[[55, 58, 75],
        [78, 78, 20],
        [94, 32, 47],
        [98, 81, 23]],

       [[69, 76, 50],
        [98, 57, 92],
        [48, 36, 88],
        [83, 20, 31]],

       [[91, 80, 90],
        [58, 75, 93],
        [60, 40, 30],
        [30, 25, 50]],

       [[43, 76, 20],
        [68, 43, 42],
        [85, 34, 46],
        [86, 66, 39]]])

3] Shuffle it along the first axis (in-situ) :

In [7]: np.random.shuffle(a3D)

In [8]: a3D
Out[8]: 
array([[[69, 76, 50],
        [98, 57, 92],
        [48, 36, 88],
        [83, 20, 31]],

       [[43, 76, 20],
        [68, 43, 42],
        [85, 34, 46],
        [86, 66, 39]],

       [[55, 58, 75],
        [78, 78, 20],
        [94, 32, 47],
        [98, 81, 23]],

       [[91, 80, 90],
        [58, 75, 93],
        [60, 40, 30],
        [30, 25, 50]]])

4] Verify the changes back in the original array :

In [9]: a
Out[9]: 
array([[69, 76, 50],
       [98, 57, 92],
       [48, 36, 88],
       [83, 20, 31],
       [43, 76, 20],
       [68, 43, 42],
       [85, 34, 46],
       [86, 66, 39],
       [55, 58, 75],
       [78, 78, 20],
       [94, 32, 47],
       [98, 81, 23],
       [91, 80, 90],
       [58, 75, 93],
       [60, 40, 30],
       [30, 25, 50]])

Runtime test

In [102]: a = np.random.randint(11,99,(16000,3))

In [103]: df = pd.DataFrame(a)

# @piRSquared's soln1
In [106]: %timeit df.iloc[np.random.permutation(np.arange(df.shape[0]).reshape(-1, 4)).ravel()]
100 loops, best of 3: 2.88 ms per loop

# @piRSquared's soln2
In [107]: %%timeit
     ...: d = df.set_index(np.arange(len(df)) // 4, append=True).swaplevel(0, 1)
     ...: pd.concat([d.xs(i) for i in np.random.permutation(range(4))])
100 loops, best of 3: 3.48 ms per loop

# Array based soln-1
In [108]: %timeit array_shuffle(a, n=4)
100 loops, best of 3: 3.38 ms per loop

# Array based soln-2
In [109]: %timeit array_permuted_indexing(a, n=4)
10000 loops, best of 3: 125 µs per loop

Upvotes: 3

piRSquared

Reputation: 294458

Setup

Consider the dataframe df

df = pd.DataFrame(np.random.randint(10, size=(16, 4)), columns=list('WXYZ'))
df

    W  X  Y  Z
0   9  8  6  2
1   0  9  5  5
2   7  5  9  4
3   7  1  1  8
4   7  7  2  2
5   5  5  0  2
6   9  3  2  7
7   5  7  2  9
8   6  6  2  8
9   0  7  0  8
10  7  5  5  2
11  6  0  9  5
12  9  2  2  2
13  8  8  2  5
14  4  1  5  6
15  1  2  3  9

Option 1
Inspired by @B.M. and @Divakar
I'm using np.random.permutation because it returns a copy that is a permuted version of what was passed. This means I can then pass that directly to iloc and return what I need.

df.iloc[np.random.permutation(np.arange(16).reshape(-1, 4)).ravel()]

    W  X  Y  Z
12  9  2  2  2
13  8  8  2  5
14  4  1  5  6
15  1  2  3  9
0   9  8  6  2
1   0  9  5  5
2   7  5  9  4
3   7  1  1  8
8   6  6  2  8
9   0  7  0  8
10  7  5  5  2
11  6  0  9  5
4   7  7  2  2
5   5  5  0  2
6   9  3  2  7
7   5  7  2  9

Option 2

I'd add a level to the index that we can call on when shuffling

d = df.set_index(np.arange(len(df)) // 4, append=True).swaplevel(0, 1)
d

      W  X  Y  Z
0 0   9  8  6  2
  1   0  9  5  5
  2   7  5  9  4
  3   7  1  1  8
1 4   7  7  2  2
  5   5  5  0  2
  6   9  3  2  7
  7   5  7  2  9
2 8   6  6  2  8
  9   0  7  0  8
  10  7  5  5  2
  11  6  0  9  5
3 12  9  2  2  2
  13  8  8  2  5
  14  4  1  5  6
  15  1  2  3  9

Then we can shuffle

pd.concat([d.xs(i) for i in np.random.permutation(range(4))])

    W  X  Y  Z
12  9  2  2  2
13  8  8  2  5
14  4  1  5  6
15  1  2  3  9
4   7  7  2  2
5   5  5  0  2
6   9  3  2  7
7   5  7  2  9
0   9  8  6  2
1   0  9  5  5
2   7  5  9  4
3   7  1  1  8
8   6  6  2  8
9   0  7  0  8
10  7  5  5  2
11  6  0  9  5

Upvotes: 2

How to shuffle data in python keeping some n number of rows intact

Answers (3)

Related Questions