Reputation: 41
I want to shuffle my data in such manner that each 4 rows remain intact. For example I have 16 rows then first 4 rows can go to last and then second four rows may go to third and so on in any particular order. I am trying to do thins in python
Upvotes: 1
Views: 662
Reputation: 56
Below code in python does the magic
from random import shuffle
import numpy as np
from math import ceil
#creating sample dataset
d=[[i*4 +j for i in range(5)] for j in range(25)]
a = np.array(d, int)
print '--------------Input--------------'
print a
gl=4 #group length i.e number of rows needs to be intact
parts=ceil(1.0*len(a)/gl) #no of partitions based on grouplength for the given dataset
#creating partition list and shuffling it to use later
x = [i for i in range(int(parts))]
shuffle(x)
#Creates new dataset based on shuffled partition list
fg=x.pop(0)
f = a[gl*fg:gl*(fg+1)]
for i in x:
t=a[gl*i:(i+1)*gl]
f=np.concatenate((f, t), axis=0)
print '--------------Output--------------'
print f
Upvotes: 0
Reputation: 221614
Reshape splitting the first axis into two with the later of length same as the group length = 4
, giving us a 3D
array and then use np.random.shuffle
, which shuffles along the first axis. The reshaped version being a view into the original array, assigns back the results directly into it. Being in-situ
, this should be pretty efficient (both memory-wise and on performance).
Hence, the implementation would be as simple as this -
def array_shuffle(a, n=4):
a3D = a.reshape(a.shape[0]//n,n,-1) # a is input array
np.random.shuffle(a3D)
Another variant of it would be to generate random permutations covering the length of the 3D
array, then indexing into it with those and finally reshaping back to 2D
.This makes a copy, but seems more performant than in-situ
edits as shown in the previous method.
The implementation would be -
def array_permuted_indexing(a, n=4):
m = a.shape[0]//n
a3D = a.reshape(m, n, -1)
return a3D[np.random.permutation(m)].reshape(-1,a3D.shape[-1])
Step-by-step run on shuffling method -
1] Setup random input array and split into a 3D version :
In [2]: np.random.seed(0)
In [3]: a = np.random.randint(11,99,(16,3))
In [4]: a3D = a.reshape(a.shape[0]//4,4,-1)
In [5]: a
Out[5]:
array([[55, 58, 75],
[78, 78, 20],
[94, 32, 47],
[98, 81, 23],
[69, 76, 50],
[98, 57, 92],
[48, 36, 88],
[83, 20, 31],
[91, 80, 90],
[58, 75, 93],
[60, 40, 30],
[30, 25, 50],
[43, 76, 20],
[68, 43, 42],
[85, 34, 46],
[86, 66, 39]])
2] Check the 3D array :
In [6]: a3D
Out[6]:
array([[[55, 58, 75],
[78, 78, 20],
[94, 32, 47],
[98, 81, 23]],
[[69, 76, 50],
[98, 57, 92],
[48, 36, 88],
[83, 20, 31]],
[[91, 80, 90],
[58, 75, 93],
[60, 40, 30],
[30, 25, 50]],
[[43, 76, 20],
[68, 43, 42],
[85, 34, 46],
[86, 66, 39]]])
3] Shuffle it along the first axis (in-situ) :
In [7]: np.random.shuffle(a3D)
In [8]: a3D
Out[8]:
array([[[69, 76, 50],
[98, 57, 92],
[48, 36, 88],
[83, 20, 31]],
[[43, 76, 20],
[68, 43, 42],
[85, 34, 46],
[86, 66, 39]],
[[55, 58, 75],
[78, 78, 20],
[94, 32, 47],
[98, 81, 23]],
[[91, 80, 90],
[58, 75, 93],
[60, 40, 30],
[30, 25, 50]]])
4] Verify the changes back in the original array :
In [9]: a
Out[9]:
array([[69, 76, 50],
[98, 57, 92],
[48, 36, 88],
[83, 20, 31],
[43, 76, 20],
[68, 43, 42],
[85, 34, 46],
[86, 66, 39],
[55, 58, 75],
[78, 78, 20],
[94, 32, 47],
[98, 81, 23],
[91, 80, 90],
[58, 75, 93],
[60, 40, 30],
[30, 25, 50]])
Runtime test
In [102]: a = np.random.randint(11,99,(16000,3))
In [103]: df = pd.DataFrame(a)
# @piRSquared's soln1
In [106]: %timeit df.iloc[np.random.permutation(np.arange(df.shape[0]).reshape(-1, 4)).ravel()]
100 loops, best of 3: 2.88 ms per loop
# @piRSquared's soln2
In [107]: %%timeit
...: d = df.set_index(np.arange(len(df)) // 4, append=True).swaplevel(0, 1)
...: pd.concat([d.xs(i) for i in np.random.permutation(range(4))])
100 loops, best of 3: 3.48 ms per loop
# Array based soln-1
In [108]: %timeit array_shuffle(a, n=4)
100 loops, best of 3: 3.38 ms per loop
# Array based soln-2
In [109]: %timeit array_permuted_indexing(a, n=4)
10000 loops, best of 3: 125 µs per loop
Upvotes: 3
Reputation: 294458
Setup
Consider the dataframe df
df = pd.DataFrame(np.random.randint(10, size=(16, 4)), columns=list('WXYZ'))
df
W X Y Z
0 9 8 6 2
1 0 9 5 5
2 7 5 9 4
3 7 1 1 8
4 7 7 2 2
5 5 5 0 2
6 9 3 2 7
7 5 7 2 9
8 6 6 2 8
9 0 7 0 8
10 7 5 5 2
11 6 0 9 5
12 9 2 2 2
13 8 8 2 5
14 4 1 5 6
15 1 2 3 9
Option 1
Inspired by @B.M. and @Divakar
I'm using np.random.permutation
because it returns a copy that is a permuted version of what was passed. This means I can then pass that directly to iloc
and return what I need.
df.iloc[np.random.permutation(np.arange(16).reshape(-1, 4)).ravel()]
W X Y Z
12 9 2 2 2
13 8 8 2 5
14 4 1 5 6
15 1 2 3 9
0 9 8 6 2
1 0 9 5 5
2 7 5 9 4
3 7 1 1 8
8 6 6 2 8
9 0 7 0 8
10 7 5 5 2
11 6 0 9 5
4 7 7 2 2
5 5 5 0 2
6 9 3 2 7
7 5 7 2 9
Option 2
I'd add a level to the index that we can call on when shuffling
d = df.set_index(np.arange(len(df)) // 4, append=True).swaplevel(0, 1)
d
W X Y Z
0 0 9 8 6 2
1 0 9 5 5
2 7 5 9 4
3 7 1 1 8
1 4 7 7 2 2
5 5 5 0 2
6 9 3 2 7
7 5 7 2 9
2 8 6 6 2 8
9 0 7 0 8
10 7 5 5 2
11 6 0 9 5
3 12 9 2 2 2
13 8 8 2 5
14 4 1 5 6
15 1 2 3 9
Then we can shuffle
pd.concat([d.xs(i) for i in np.random.permutation(range(4))])
W X Y Z
12 9 2 2 2
13 8 8 2 5
14 4 1 5 6
15 1 2 3 9
4 7 7 2 2
5 5 5 0 2
6 9 3 2 7
7 5 7 2 9
0 9 8 6 2
1 0 9 5 5
2 7 5 9 4
3 7 1 1 8
8 6 6 2 8
9 0 7 0 8
10 7 5 5 2
11 6 0 9 5
Upvotes: 2