Invader
Invader

Reputation: 161

What is the best way to shuffle/permute each n rows of a data frame in python?

I want to shuffle each n (window size) rows of a data frame but I am not sure how to do it in a pythonic way. I found answers for shuffling all rows but not for a given window size:

def permute(df: pd.DataFrame, window_size: int = 10) -> pd.DataFrame:
    df_permuted = df.copy()
    """How would you shuffle every window_size rows for the modifiable columns?"""
    df_permuted.loc[:, modifiable_columns]
    ...
    return df_permuted

Upvotes: 1

Views: 210

Answers (2)

npetrov937
npetrov937

Reputation: 168

To add the additional requirement that is in your code's comment, but not in your question, here's a version that also takes into account modifiable columns.

In the example below, mod and mod2 are your modifiable columns, while the nomod column is not modifiable.

I believe that modifiable columns cannot be achieved using a vectorized approach and adds to the accepted answer. Also, the accepted answer keeps in memory another full representation of the entire df, while my version only keeps a memory record as large as window_size.

df = pd.DataFrame([np.arange(0, 12)]*3).T
df.columns = ['mod', 'nomod', 'mod2']
df

    mod     nomod   mod2
0   0   0   0
1   1   1   1
2   2   2   2
3   3   3   3
4   4   4   4
5   5   5   5
6   6   6   6
7   7   7   7
8   8   8   8
9   9   9   9
10  10  10  10
11  11  11  11
def permute(df, window_size, modifiable_columns):
    num_chunks = int(len(df) / window_size)
    for i in range(0, num_chunks):
        start_ind = i * window_size
        end_ind = i * window_size + window_size
        
        df_row_subset = df.loc[start_ind:end_ind-1, modifiable_columns].sample(frac=1, random_state=1)
        df_row_subset.index = np.arange(start_ind, end_ind)
        
        df.loc[df_row_subset.index, modifiable_columns] = df_row_subset
        
    return df

permute(df, 4, ['mod', 'mod2'])

    mod     nomod   mod2
0   3   0   3
1   2   1   2
2   0   2   0
3   1   3   1
4   7   4   7
5   6   5   6
6   4   6   4
7   5   7   5
8   11  8   11
9   10  9   10
10  8   10  8
11  9   11  9

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150765

The accepted answer is not vectorized. Using groupby.sample is a better choice:

df.groupby(np.arange(len(df))//N).sample(frac=1)

Upvotes: 1

Related Questions