sdbbs
sdbbs

Reputation: 5384

Sort by one column, then group by another, in Pandas Dataframe?

This is a sort of an opposite question from ones I could find with similar wording, such as:

Say, I have this DataFrame:

import pandas as pd

df = pd.DataFrame({
  'model': ['Punto', 'Doblo', 'Panda', 'Doblo','Punto', 'Tipo'] ,
  'timestamp': ['20200124_083155', '20200124_122052', '20200124_134350', '20200124_150801', '20200124_163540', '20200124_195955']
})
print(df)

This prints out:

   model        timestamp
0  Punto  20200124_083155
1  Doblo  20200124_122052
2  Panda  20200124_134350
3  Doblo  20200124_150801
4  Punto  20200124_163540
5   Tipo  20200124_195955

What I would like to obtain is: first sort by timestamp; then in order of appearance in that sort, group by order of appearance - but without the extra "group" columns that the pandas .groupby clause would add; that is, I would like to obtain the final output:

   model        timestamp
0  Punto  20200124_083155
1  Punto  20200124_163540
2  Doblo  20200124_122052
3  Doblo  20200124_150801
4  Panda  20200124_134350
5   Tipo  20200124_195955

How can I achieve that?

Upvotes: 1

Views: 1530

Answers (1)

jezrael
jezrael

Reputation: 862511

I think this is possible by ordered categoricals, set order by sorted timestamp values in first step and then sorting by both columns by DataFrame.sort_values:

c = df.sort_values('timestamp')['model'].unique()

df['model'] = pd.Categorical(df['model'], ordered=True, categories=c)

df = df.sort_values(['model','timestamp'])
print (df)
   model        timestamp
0  Punto  20200124_083155
4  Punto  20200124_163540
1  Doblo  20200124_122052
3  Doblo  20200124_150801
2  Panda  20200124_134350
5   Tipo  20200124_195955

Upvotes: 2

Related Questions