user3768495
user3768495

Reputation: 4637

pandas groupby then aggregate results order not repeatable?

Here's my codes:

import pandas as pd
df = pd.DataFrame({'id': ['a', 'a', 'b', 'b', 'b'],
                   'v1': [1, 2, 3 ,4, 5],
                   'v2': [7, 6, 5, 4, 3],
                   'v3': [2, 4, 6, 8, 10]})

df.groupby('id').agg({'v1': ['count', 'sum'],
                      'v2': 'mean',
                      'v3': 'sum'})

My issue is that every time I run this code (in jupyter notebook), the resulted dataframe have different column orders. Is this a bug in pandas?

My next step is to rename the resulted dataframe, but with no reproducible order, it is kinda impossible to write a reusable code to do that. How could I work around it?

Btw, I am using python 3.5 and pandas 0.23.0.

Thanks!

Upvotes: 0

Views: 210

Answers (2)

jeschwar
jeschwar

Reputation: 1314

dict keys are not ordered in Python 3.5; this applies to the dicts used in pd.DataFrame() and .agg() in your code. You could used an OrderedDict from the collections module in the Python standard library or you could use DataFrame.reindex() to re-organize your columns and rows as follows:

import pandas as pd
df = pd.DataFrame({
    'id': ['a', 'a', 'b', 'b', 'b'],
    'v1': [1, 2, 3 ,4, 5],
    'v2': [7, 6, 5, 4, 3],
    'v3': [2, 4, 6, 8, 10]
})

df = df.set_index('id')
df = df.reindex(columns=['v1', 'v2', 'v3'])

df.groupby(level='id').agg({
    'v1': ['count', 'sum'],
    'v2': 'mean',
    'v3': 'sum'
})

DataFrame.reindex() is more general than DataFrame.sort_index().

Upvotes: 0

sacuL
sacuL

Reputation: 51335

I don't get different ordering when I run the code multiple times. However, if you're running into this problem, you could just name the order you want after your agg. For instance, if you wanted the order v2, v3, v1, do:

df.groupby('id',sort=False).agg({'v1': ['count', 'sum'],
                      'v2': 'mean',
                      'v3': 'sum'})[['v2','v3','v1']]

     v2  v3    v1    
   mean sum count sum
id                   
a   6.5   6     2   3
b   4.0  24     3  12

Or if you simply want the same ordering as your original dataframe:

df.groupby('id',as_index=False).agg({'v1': ['count', 'sum'],
                                     'v2': 'mean',
                                     'v3': 'sum'})[df.columns]

  id    v1       v2  v3
     count sum mean sum
0  a     2   3  6.5   6
1  b     3  12  4.0  24

but all in all, @Allolz's comment makes the most sense IMO, using sort_index(1):

df.groupby('id').agg({'v1': ['count', 'sum'],
                      'v2': 'mean',
                      'v3': 'sum'}).sort_index(1)

Upvotes: 3

Related Questions