Reputation: 425

Shuffle a DataFrame while keeping internal order

I have a dataframe that contains a pre-processed data, such that every 4 rows is a sequence (later to be reshaped and used for lstm training).

I want to shuffle the dataframe, but I want to keep every sequence of rows untouched . For example: a = [1,2,3,4,10,11,12,13,20,21,22,23] will turn into something like: a = [20,21,22,23,1,2,3,4,10,11,12,13].

df.sample(frac=1) is not enough since it will break the sequences.

Solution , thanks to @Wen-Ben:

seq_length = 4 
length_array = np.arange((df.shape[0]//seq_length)*seq_length)
trunc_data = df.head((df.shape[0]//seq_length)*seq_length)
d = {x : y for x, y in trunc_data.groupby(length_array//seq_length)}
yourdf = pd.concat([d.get(x) for x in np.random.choice(len(d),len(d.keys()),replace=False)])

Upvotes: 3

Answers (3)

Samba Gangineni

Reputation: 173

As you said that you have the data in sequences of 4, then the length of the data frame should be in multiples of 4. If your data is in sequences of 3, kindly, change 4 to 3 in the code.

>>> import pandas as pd
>>> import numpy as np

Creating the table:

>>> df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8],'col2':['a','b','c','d','e','f','g','h']})
>>> df
   col1 col2
0     1    a
1     2    b
2     3    c
3     4    d
4     5    e
5     6    f
6     7    g
7     8    h
>>> df.shape[0]
8

Creating the list for shuffling:

>>> np_range = np.arange(0,df.shape[0])
>>> np_range
array([0, 1, 2, 3, 4, 5, 6, 7])

Reshaping and shuffling:

>>> np_range1 = np.reshape(np_range,(df.shape[0]/4,4))
>>> np_range1
array([[0, 1, 2, 3],
       [4, 5, 6, 7]])
>>> np.random.shuffle(np_range1)
>>> np_range1
array([[4, 5, 6, 7],
       [0, 1, 2, 3]])
>>> np_range2 = np.reshape(np_range1,(df.shape[0],))
>>> np_range2
array([4, 5, 6, 7, 0, 1, 2, 3])

Selecting the data:

>>> new_df = df.loc[np_range2]
>>> new_df
   col1 col2
4     5    e
5     6    f
6     7    g
7     8    h
0     1    a
1     2    b
2     3    c
3     4    d

I hope this helps! Thank you!

Upvotes: 0

BENY

Reputation: 323276

Is this what you need , np.random.choice

d={x : y for x, y in df.groupby(np.arange(len(df))//4)}

yourdf=pd.concat([d.get(x) for x in np.random.choice(len(d),2,replace=False)])
yourdf
Out[986]: 
   col1 col2
4     5    e
5     6    f
6     7    g
7     8    h
0     1    a
1     2    b
2     3    c
3     4    d

Upvotes: 1

P Maschhoff

Reputation: 186

You can reshuffle in groups of 4 by... grouping the index into groups of four and then shuffling them.

Example:

df = pd.DataFrame(np.random.randint(10, size=(12, 2)))

new_index = np.array(df.index).reshape(-1, 4)
np.random.shuffle(new_index)  # shuffles array in-place
df = df.loc[new_index.reshape(-1)]

Upvotes: 1

Shuffle a DataFrame while keeping internal order

Answers (3)

Related Questions