gongzhitaao
gongzhitaao

Reputation: 6682

Permute groups in Pandas

Say I have a Pandas DataFrame whose data look like

import numpy as np
import pandas as pd

n = 30
df = pd.DataFrame({'a': np.arange(n),
                   'b': np.random.choice([0, 1, 2], n),
                   'c': np.arange(n)})

Question: how to permute groups (grouped by b column)?

Not permutation within each group, but permutation in group level?


Example

Before

a b c
1 0 1
2 0 2
3 1 3
4 1 4
5 2 5
6 2 6

After

a b c
3 1 3
4 1 4
1 0 1
2 0 2
5 2 5
6 2 6

Basically before permutation, df['b'].unqiue() == [0, 1, 2], after permutation, df['b'].unique() == [1, 0, 2].

Upvotes: 1

Views: 402

Answers (2)

sparc_spread
sparc_spread

Reputation: 10843

Here's an answer inspired by the accepted answer to this SO post, which uses a temporary Categorical column as a sorting key to do custom sort orderings. In this answer, I produce all permutations, but you can just take the first one if you are looking for only one.

import itertools

df_results = list()
orderings = itertools.permutations(df["b"].unique())
for ordering in orderings:
    df_2 = df.copy()
    df_2["b_key"] = pd.Categorical(df_2["b"], [i for i in ordering])
    df_2.sort_values("b_key", inplace=True)
    df_2.drop(["b_key"], axis=1, inplace=True)
    df_results.append(df_2)

for df in df_results:
    print(df)

The idea here is that we create a new categorical variable each time, with a slightly different enumerated order, then sort by it. We discard it at the end once we no longer need it.

Upvotes: 1

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

If i understood your question correctly, you can do it this way:

n = 30
df = pd.DataFrame({'a': np.arange(n),
                   'b': np.random.choice([0, 1, 2], n),
                   'c': np.arange(n)})

order = pd.Series([1,0,2])

cols = df.columns

df['idx'] = df.b.map(order)

index = df.index

df = df.reset_index().sort_values(['idx', 'index'])[cols]

Step by step:

In [103]: df['idx'] = df.b.map(order)

In [104]: df
Out[104]:
     a  b   c  idx
0    0  2   0    2
1    1  0   1    1
2    2  1   2    0
3    3  0   3    1
4    4  1   4    0
5    5  1   5    0
6    6  1   6    0
7    7  2   7    2
8    8  0   8    1
9    9  1   9    0
10  10  0  10    1
11  11  1  11    0
12  12  0  12    1
13  13  2  13    2
14  14  0  14    1
15  15  2  15    2
16  16  1  16    0
17  17  2  17    2
18  18  1  18    0
19  19  1  19    0
20  20  0  20    1
21  21  0  21    1
22  22  1  22    0
23  23  1  23    0
24  24  2  24    2
25  25  0  25    1
26  26  0  26    1
27  27  0  27    1
28  28  1  28    0
29  29  1  29    0

In [105]: df.reset_index().sort_values(['idx', 'index'])
Out[105]:
    index   a  b   c  idx
2       2   2  1   2    0
4       4   4  1   4    0
5       5   5  1   5    0
6       6   6  1   6    0
9       9   9  1   9    0
11     11  11  1  11    0
16     16  16  1  16    0
18     18  18  1  18    0
19     19  19  1  19    0
22     22  22  1  22    0
23     23  23  1  23    0
28     28  28  1  28    0
29     29  29  1  29    0
1       1   1  0   1    1
3       3   3  0   3    1
8       8   8  0   8    1
10     10  10  0  10    1
12     12  12  0  12    1
14     14  14  0  14    1
20     20  20  0  20    1
21     21  21  0  21    1
25     25  25  0  25    1
26     26  26  0  26    1
27     27  27  0  27    1
0       0   0  2   0    2
7       7   7  2   7    2
13     13  13  2  13    2
15     15  15  2  15    2
17     17  17  2  17    2
24     24  24  2  24    2

Upvotes: 1

Related Questions