Reputation: 527
Apologies for a vague title; hopefully I can do a better job explaining. I have a pandas
DataFrame df
(sorted by c1
):
c1 c2
A A
A D
A E
B C
B D
C E
C G
D C
D F
E A
E D
E F
F C
F E
F G
G A
G B
G D
G G
How I want to sort df
is sorting c1
for a single value, then collect all corresponding c2
values for such c1
value and sort c1
by those values next. In other words,
c1 c2
A A
A D (*)
A E (**)
D (*) C (***)
D F (****)
E (**) A
E D
E F
C (***) E
C G (*****)
F (****) C
F E
F G
G (*****) A
G B (******)
G D
G G
B (******) C
B D
I used the *s for indicators. Here, you can see that c1
is sorted by elements in the order of appearance in c2
.
I first tried sorting both columns together,
df = df.sort_values(by = ["c1", "c2"])
or reindexing after sorting them differently,
df.reindex(pd.DataFrame(np.sort(df.values,1)).sort_values([0,1]).index)
and also have tried zipping them,
df['c1', 'c2'] = [list(x) for x in zip(*sorted(zip(df["c1"], df["c2"]), key=lambda pair: pair[0]))]
but none of the methods seem to deliver what I want. They simply deliver the identical-looking DataFrame to how df
is initially sorted. Any insights on this custom sorting?
Upvotes: 2
Views: 105
Reputation: 862591
You can use ordered categoricals by unique values of df2['c2']
column, so after DataFrame.sort_values
by c1
is sorting like you need :
df['c1'] = pd.CategoricalIndex(df['c1'], ordered=True, categories=df['c2'].unique())
df = df.sort_values(by = "c1")
print (df)
0 A A
1 A D
2 A E
8 D F
7 D C
11 E F
10 E D
9 E A
5 C E
6 C G
17 G D
15 G A
16 G B
18 G G
12 F C
13 F E
14 F G
4 B D
3 B C
If possible some values from c1
not exist in c2
(like F1
values in changed sample data) then solution is a bit different:
print (df)
c1 c2
0 A A
1 A D
2 A E
3 B C
4 B D
5 C E
6 C G
7 D C
8 D F
9 E A
10 E D
11 E F
12 F C
13 F1 E
14 F1 G
15 G A
16 G B
17 G D
18 G G
union = df[['c2','c1']].unstack().drop_duplicates()
print (union)
c2 0 A
1 D
2 E
3 C
6 G
8 F
16 B
c1 13 F1
dtype: object
df['c1'] = pd.CategoricalIndex(df['c1'], ordered=True, categories=union)
df = df.sort_values(by = "c1")
print (df)
c1 c2
0 A A
1 A D
2 A E
8 D F
7 D C
11 E F
10 E D
9 E A
5 C E
6 C G
17 G D
15 G A
16 G B
18 G G
12 F C
4 B D
3 B C
13 F1 E
14 F1 G
Upvotes: 3