Reputation: 161
I want to shuffle each n (window size) rows of a data frame but I am not sure how to do it in a pythonic way. I found answers for shuffling all rows but not for a given window size:
def permute(df: pd.DataFrame, window_size: int = 10) -> pd.DataFrame:
df_permuted = df.copy()
"""How would you shuffle every window_size rows for the modifiable columns?"""
df_permuted.loc[:, modifiable_columns]
...
return df_permuted
Upvotes: 1
Views: 210
Reputation: 168
To add the additional requirement that is in your code's comment, but not in your question, here's a version that also takes into account modifiable columns.
In the example below, mod
and mod2
are your modifiable columns, while the nomod
column is not modifiable.
I believe that modifiable columns cannot be achieved using a vectorized approach and adds to the accepted answer. Also, the accepted answer keeps in memory another full representation of the entire df, while my version only keeps a memory record as large as window_size
.
df = pd.DataFrame([np.arange(0, 12)]*3).T
df.columns = ['mod', 'nomod', 'mod2']
df
mod nomod mod2
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
11 11 11 11
def permute(df, window_size, modifiable_columns):
num_chunks = int(len(df) / window_size)
for i in range(0, num_chunks):
start_ind = i * window_size
end_ind = i * window_size + window_size
df_row_subset = df.loc[start_ind:end_ind-1, modifiable_columns].sample(frac=1, random_state=1)
df_row_subset.index = np.arange(start_ind, end_ind)
df.loc[df_row_subset.index, modifiable_columns] = df_row_subset
return df
permute(df, 4, ['mod', 'mod2'])
mod nomod mod2
0 3 0 3
1 2 1 2
2 0 2 0
3 1 3 1
4 7 4 7
5 6 5 6
6 4 6 4
7 5 7 5
8 11 8 11
9 10 9 10
10 8 10 8
11 9 11 9
Upvotes: 0
Reputation: 150765
The accepted answer is not vectorized. Using groupby.sample
is a better choice:
df.groupby(np.arange(len(df))//N).sample(frac=1)
Upvotes: 1