raven
raven

Reputation: 527

How to custom sort two pandas dataframe column in Python?

Apologies for a vague title; hopefully I can do a better job explaining. I have a pandas DataFrame df (sorted by c1):

c1       c2
A        A
A        D
A        E
B        C
B        D
C        E
C        G
D        C
D        F
E        A
E        D
E        F
F        C
F        E
F        G
G        A
G        B
G        D
G        G

How I want to sort df is sorting c1 for a single value, then collect all corresponding c2 values for such c1 value and sort c1 by those values next. In other words,

c1          c2
A           A
A           D (*)
A           E (**)
D (*)       C (***)
D           F (****)
E (**)      A
E           D
E           F
C (***)     E
C           G (*****)
F (****)    C
F           E
F           G
G (*****)   A
G           B (******)
G           D
G           G
B (******)  C
B           D

I used the *s for indicators. Here, you can see that c1 is sorted by elements in the order of appearance in c2.

I first tried sorting both columns together,

df = df.sort_values(by = ["c1", "c2"])

or reindexing after sorting them differently,

df.reindex(pd.DataFrame(np.sort(df.values,1)).sort_values([0,1]).index)

and also have tried zipping them,

df['c1', 'c2'] = [list(x) for x in zip(*sorted(zip(df["c1"], df["c2"]), key=lambda pair: pair[0]))]

but none of the methods seem to deliver what I want. They simply deliver the identical-looking DataFrame to how df is initially sorted. Any insights on this custom sorting?

Upvotes: 2

Views: 105

Answers (1)

jezrael
jezrael

Reputation: 862591

You can use ordered categoricals by unique values of df2['c2'] column, so after DataFrame.sort_values by c1 is sorting like you need :

df['c1'] = pd.CategoricalIndex(df['c1'], ordered=True, categories=df['c2'].unique())
df = df.sort_values(by = "c1")
print (df)
0   A  A
1   A  D
2   A  E
8   D  F
7   D  C
11  E  F
10  E  D
9   E  A
5   C  E
6   C  G
17  G  D
15  G  A
16  G  B
18  G  G
12  F  C
13  F  E
14  F  G
4   B  D
3   B  C

If possible some values from c1 not exist in c2 (like F1 values in changed sample data) then solution is a bit different:

print (df)
    c1 c2
0    A  A
1    A  D
2    A  E
3    B  C
4    B  D
5    C  E
6    C  G
7    D  C
8    D  F
9    E  A
10   E  D
11   E  F
12   F  C
13  F1  E
14  F1  G
15   G  A
16   G  B
17   G  D
18   G  G

union = df[['c2','c1']].unstack().drop_duplicates()
print (union)
c2  0      A
    1      D
    2      E
    3      C
    6      G
    8      F
    16     B
c1  13    F1
dtype: object

df['c1'] = pd.CategoricalIndex(df['c1'], ordered=True, categories=union)
df = df.sort_values(by = "c1")
print (df)
    c1 c2
0    A  A
1    A  D
2    A  E
8    D  F
7    D  C
11   E  F
10   E  D
9    E  A
5    C  E
6    C  G
17   G  D
15   G  A
16   G  B
18   G  G
12   F  C
4    B  D
3    B  C
13  F1  E
14  F1  G

Upvotes: 3

Related Questions