Heisenbug
Heisenbug

Reputation: 126

Sorting rows in pandas first by timestamp values and then by giving particular order to categorical values of a column

I have a pandas dataframe which has a column "user" containing categorical values(a,b,c,d). I only care about the ordering of two categories in ascending order (a, d). So (a,b,c,d) and (a,c,b,d) both are fine for me.

How to create the ordering is the first part of the question?

Secondly I have another column which contains "timestamps". I want to order my rows first by "timestamps" and then for the rows with same timestamps I want to sort with the above ordering of categorical values.

Lets say my data frame looks like this.

+-----------+------+
| Timestamp | User |
+-----------+------+
|         1 | b    |
|         2 | d    |
|         1 | a    |
|         1 | c    |
|         1 | d    |
|         2 | a    |
|         2 | b    |
+-----------+------+

I want first this kind of sorting to happen

+-----------+------+
| Timestamp | User |
+-----------+------+
|         1 | b    |
|         1 | a    |
|         1 | c    |
|         1 | d    |
|         2 | d    |
|         2 | a    |
|         2 | b    |
+-----------+------+

Followed by the categorical ordering of "user"

+-----------+------+
| Timestamp | User |
+-----------+------+
|         1 | a    |
|         1 | b    |
|         1 | c    |
|         1 | d    |
|         2 | a    |
|         2 | b    |
|         2 | d    |
+-----------+------+

OR


+-----------+------+
| Timestamp | User |
+-----------+------+
|         1 | a    |
|         1 | c    |
|         1 | b    |
|         1 | d    |
|         2 | a    |
|         2 | b    |
|         2 | d    |
+-----------+------+

As you can see the "c" and "b"'s order do not matter.

Upvotes: 1

Views: 524

Answers (1)

jezrael
jezrael

Reputation: 863166

You can specify order in ordered categorical by categories and then call DataFrame.sort_values:

df['User'] = pd.Categorical(df['User'], ordered=True, categories=['a','b','c','d'])

df = df.sort_values(['Timestamp','User'])
print (df)
   Timestamp User
2          1    a
0          1    b
3          1    c
4          1    d
5          2    a
6          2    b
1          2    d

If there is many values of User is possible dynamically create categories:

vals = ['a', 'd']
cats = vals + np.setdiff1d(df['User'], vals).tolist() 
print (cats)
['a', 'd', 'b', 'c']

df['User'] = pd.Categorical(df['User'], ordered=True, categories=cats)
df = df.sort_values(['Timestamp','User'])
print (df)
   Timestamp User
2          1    a
4          1    d
0          1    b
3          1    c
5          2    a
1          2    d
6          2    b

Upvotes: 1

Related Questions