Pandas groupby: different indexes returned

Question

Versions:
* Pandas 0.21.0
* Python 2.7.12

I see this strange behaviour in Pandas: according to the dataframe and the sort, groupby returns different results:

import pandas as pd
from datetime import datetime

df1 = pd.DataFrame({'dtime': [datetime(2017,1,1,1,5), datetime(2017,1,1,1,20)],
                   'val1': [11, None],
                   'val2': [None, 31]
                  })

df2 = pd.DataFrame({'dtime': [datetime(2017,1,1,1,5), datetime(2017,1,1,1,20)],
                   'val1': [11, None],
                   'val2': [31, None]
                  })



df1b = df1.melt("dtime").dropna().set_index("dtime")
df2b = df2.melt("dtime").dropna().set_index("dtime")

r1a = df1b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=True))
r1b = df1b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=False))

r2a = df2b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=True))
r2b = df2b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=False))

print "
--- df1 ascending sort ---------------
", r1a
print "
--- df1 descending sort SHOULD IT ALWAYS BE LIKE THIS? --------------
", r1b
print "
--- df2 ascending sort ---------------
", r2a
print "
--- df2 descending sort --------------
", r2b

return

--- df1 ascending sort ---------------
dtime
2017-01-01 01:05:00    11.0
2017-01-01 01:20:00    31.0
Name: value, dtype: float64

--- df1 descending sort SHOULD IT ALWAYS BE LIKE THIS? --------------
dtime                dtime              
2017-01-01 01:00:00  2017-01-01 01:20:00    31.0
                     2017-01-01 01:05:00    11.0
Name: value, dtype: float64

--- df2 ascending sort ---------------
dtime
2017-01-01 01:05:00    11.0
2017-01-01 01:05:00    31.0
Name: value, dtype: float64

--- df2 descending sort --------------
dtime
2017-01-01 01:05:00    31.0
2017-01-01 01:05:00    11.0
Name: value, dtype: float64

Why df1 descending sort returns the reference time for grouping (2017-01-01 01:00:00), while it is missing in the other groups? I would expect that all grouping return a reference time as multi-index, without exception.

I created an issue at github pandas #18908 but in a month had no useful answer. I would like to understand if it is just a presentation problem or a possible bug. Am I missing something?

Phil Sheard · Accepted Answer

Tested on Python 3.5 and Pandas 0.22 - using slightly different code (pd.melt() instead of df.melt).

Firstly - the reason for the difference.

df1 has values for two different timestamps (01:20:00 and 01:05:00), where df2 both observations that remain are on the same timestamp (01:05:00).

I'm not exactly sure why, but evidently the act of reversing the sort means does mean that the .values call is including the grouper time window on df1 because you're sorting values as descending within a ascending timeindex. The result is different for df2 because the timestamps are identical and so the grouper isn't required. You possibly already deduced that, so apologies for duplication if so.

Suggestion to consistently show your expected result

Normally when working with pandas.DataFrame.groupby I would use an aggregation call to manipulate the data (max in the example below but sum, count and others are available). If you're looking to see your values with the reference time of the grouper, this would be the way to approach it:

Here's df1b, with multiple timestamps against the observation values:

In []: df1b.groupby([pd.Grouper(freq="1h"), df1b.index, "variable"])\
       .max().sort_values("value", ascending=False)
Out[]:
                                                  value
dtime               dtime               variable
2017-01-01 01:00:00 2017-01-01 01:20:00 val2       31.0
                    2017-01-01 01:05:00 val1       11.0

...and same approach for df2b:

In []: df2b.groupby([pd.Grouper(freq="1h"), df2b.index, "variable"])\
       .max().sort_values("value", ascending=False)
Out[]:
                                                  value
dtime               dtime               variable
2017-01-01 01:00:00 2017-01-01 01:05:00 val2       31.0
                                        val1       11.0

Does this help? Or have I missed the significance of why you were working with the .values method?

Disclaimer: I'm not familiar with working with groupby.value so you might be trying to achieve something I've missed.

Pandas groupby: different indexes returned

Answers (2)

Related Questions