Reputation: 2749
I want to change the day of order presented by below code.
What I want is a result with the order (Mon, Tue, Wed, Thu, Fri, Sat, Sun)
- should I say, sort by key in certain predefined order?
Here is my code which needs some tweak:
f8 = df_toy_indoor2.groupby(['device_id', 'day'])['dwell_time'].sum()
print(f8)
Current result:
device_id day
device_112 Thu 436518
Wed 636451
Fri 770307
Tue 792066
Mon 826862
Sat 953503
Sun 1019298
device_223 Mon 2534895
Thu 2857429
Tue 3303173
Fri 3548178
Wed 3822616
Sun 4213633
Sat 4475221
Desired result:
device_id day
device_112 Mon 826862
Tue 792066
Wed 636451
Thu 436518
Fri 770307
Sat 953503
Sun 1019298
device_223 Mon 2534895
Tue 3303173
Wed 3822616
Thu 2857429
Fri 3548178
Sat 4475221
Sun 4213633
Here, type(df_toy_indoor2.groupby(['device_id', 'day'])['dwell_time'])
is a class 'pandas.core.groupby.SeriesGroupBy'.
I have found .sort_values()
, but it is a built-in sort function by values.
I want to get some pointers to set some order to use it further data manipulation.
Thanks in advance.
Upvotes: 18
Views: 15075
Reputation: 33783
Set the 'day'
column as categorical dtype, just make sure when you set the category your list of days is sorted as you'd like it to be. Performing the groupby
will then automatically sort it for you, but if you otherwise tried to sort the column it will sort in the correct order that you specify.
# Initial setup.
np.random.seed([3,1415])
n = 100
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
df = pd.DataFrame({
'device_id': np.random.randint(1,3,n),
'day': np.random.choice(days, n),
'dwell_time':np.random.random(n)
})
# Set as category, groupby, and sort.
df['day'] = df['day'].astype("category", categories=days, ordered=True)
df = df.groupby(['device_id', 'day']).sum()
Update: astype no longer accepts categories, use:
category_day = pd.api.types.CategoricalDtype(categories=days, ordered=True)
df['day'] = df['day'].astype(category_day)
The resulting output:
dwell_time
device_id day
1 Mon 4.428626
Tue 3.259319
Wed 2.436024
Thu 0.909724
Fri 4.974137
Sat 5.583778
Sun 2.687258
2 Mon 3.117923
Tue 2.427154
Wed 1.943927
Thu 4.599547
Fri 2.628887
Sat 6.247520
Sun 2.716886
Note that this method works for any type of customized sorting. For example, if you had a column with entries 'a', 'b', 'c'
, and wanted it to be sorted in a non-standard order, e.g. 'c', 'a', 'b'
, you'd just do the same type of procedure: specify the column as categorical with your categories being in the non-standard order you want.
Upvotes: 5
Reputation: 1050
If you sort the dataframe prior to the groupby
, pandas will maintain the order of your sort. First thing you'll have to do is come up with a good way to sort the days of the week. One way of doing that is to assign an int representing the day of the week to each row, then sort on that column. For example:
import pandas
df = pandas.DataFrame(
columns=['device_id', 'day', 'dwell_time'],
data=[[1, 'Wed', 35], [1, 'Mon', 63], [2, 'Sat', 83], [2, 'Fri', 82]]
)
df['day_of_week'] = df.apply(
lambda x: ['Mon', 'Tues', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'].index(x.day),
1
)
print(df.sort(['device_id', 'day_of_week']).groupby(['device_id', 'day'])['dwell_time'].sum())
yields:
device_id day dwell_time
1 Mon 63
Wed 35
2 Fri 82
Sat 83
Upvotes: 1
Reputation: 3677
Took me some time, but I found the solution. reindex does what you want. See my code example:
a = [1, 2] * 2 + [2, 1] * 3 + [1, 2]
b = ['Mon', 'Wed', 'Thu', 'Fri'] * 3
c = list(range(12))
df = pd.DataFrame(data=[a,b,c]).T
df.columns = ['device', 'day', 'value']
df = df.groupby(['device', 'day']).sum()
gives:
value
device day
1 Fri 7
Mon 0
Thu 12
Wed 14
2 Fri 14
Mon 12
Thu 6
Wed 1
Then doing reindex:
df.reindex(['Mon', 'Wed', 'Thu', 'Fri'], level='day')
or more conveniently (credits to burhan)
df.reindex(list(calendar.day_abbr), level='day')
gives:
value
device day
1 Mon 0
Wed 14
Thu 12
Fri 7
2 Mon 12
Wed 1
Thu 6
Fri 14
Upvotes: 26
Reputation:
Probably not the best way, but as far as I know you cannot pass a function/mapping to sort_values
. As a workaround, I generally use assign
to add a new column and sort by that column. In your example, that also requires resetting the index first (and setting it back).
days = {'Mon': 1, 'Tue': 2, 'Wed': 3, 'Thu': 4, 'Fri': 5, 'Sun': 6, 'Sat': 7}
f8 = f8.reset_index()
(f8.assign(day_num=f8['day'].map(days))
.sort_values(['device_id', 'day_num'])
.set_index(['device_id', 'day'])
.drop('day_num', axis=1))
Out:
0
device_id day
0d4fd55bb363bf6f6f7f8b3342cd0467 Mon 826862
Tue 792066
Wed 636451
Thu 436518
Fri 770307
Sun 1019298
Sat 953503
f6258edf9145d1c0404e6f3d7a27a29d Mon 2534895
Tue 3303173
Wed 3822616
Thu 2857429
Fri 3548178
Sun 4213633
Sat 4475221
Upvotes: 1