Alex Poca
Alex Poca

Reputation: 2566

Pandas groupby: different indexes returned

Versions:
* Pandas 0.21.0
* Python 2.7.12

I see this strange behaviour in Pandas: according to the dataframe and the sort, groupby returns different results:

import pandas as pd
from datetime import datetime

df1 = pd.DataFrame({'dtime': [datetime(2017,1,1,1,5), datetime(2017,1,1,1,20)],
                   'val1': [11, None],
                   'val2': [None, 31]
                  })

df2 = pd.DataFrame({'dtime': [datetime(2017,1,1,1,5), datetime(2017,1,1,1,20)],
                   'val1': [11, None],
                   'val2': [31, None]
                  })



df1b = df1.melt("dtime").dropna().set_index("dtime")
df2b = df2.melt("dtime").dropna().set_index("dtime")

r1a = df1b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=True))
r1b = df1b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=False))

r2a = df2b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=True))
r2b = df2b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=False))

print "\n--- df1 ascending sort ---------------\n", r1a
print "\n--- df1 descending sort SHOULD IT ALWAYS BE LIKE THIS? --------------\n", r1b
print "\n--- df2 ascending sort ---------------\n", r2a
print "\n--- df2 descending sort --------------\n", r2b

return

--- df1 ascending sort ---------------
dtime
2017-01-01 01:05:00    11.0
2017-01-01 01:20:00    31.0
Name: value, dtype: float64

--- df1 descending sort SHOULD IT ALWAYS BE LIKE THIS? --------------
dtime                dtime              
2017-01-01 01:00:00  2017-01-01 01:20:00    31.0
                     2017-01-01 01:05:00    11.0
Name: value, dtype: float64

--- df2 ascending sort ---------------
dtime
2017-01-01 01:05:00    11.0
2017-01-01 01:05:00    31.0
Name: value, dtype: float64

--- df2 descending sort --------------
dtime
2017-01-01 01:05:00    31.0
2017-01-01 01:05:00    11.0
Name: value, dtype: float64

Why df1 descending sort returns the reference time for grouping (2017-01-01 01:00:00), while it is missing in the other groups? I would expect that all grouping return a reference time as multi-index, without exception.

I created an issue at github pandas #18908 but in a month had no useful answer. I would like to understand if it is just a presentation problem or a possible bug. Am I missing something?

Upvotes: 1

Views: 150

Answers (2)

muskrat
muskrat

Reputation: 1559

This behavior is at first unexpected, but I do not think it is a bug. First, note that it is not ascending=False that causes this behavior (see examples below). Second, when the grouper is called such that the multi-index reference time would be redundant (or unnecessary), it is skipped. And this occurs when the values being sorted are in the same order as the timestamps. I see @Phil Sheard has proposed an approach to get your expected result (+1 for that), so below I am just expanding on the behavior and why I think it is sensible for pd to be doing that.

First, notice that this behavior happens equivalently on the ascending search when I make a df3 that has the val data in the opposite order:

import pandas as pd
from datetime import datetime

df1 = pd.DataFrame({'dtime': [datetime(2017,1,1,1,5), datetime(2017,1,1,1,20)],
                   'val1': [11, None],
                   'val2': [None, 31]
                  })

df2 = pd.DataFrame({'dtime': [datetime(2017,1,1,1,5), datetime(2017,1,1,1,20)],
                   'val1': [11, None],
                   'val2': [31, None]
                  })

df3 = pd.DataFrame({'dtime': [datetime(2017,1,1,1,5), datetime(2017,1,1,1,20)],
                   'val1': [None, 11],
                   'val2': [31, None]
                  })


df1b = df1.melt("dtime").dropna().set_index("dtime")
df2b = df2.melt("dtime").dropna().set_index("dtime")
df3b = df3.melt("dtime").dropna().set_index("dtime")

r1a = df1b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=True))
r1b = df1b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=False))

r2a = df2b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=True))
r2b = df2b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=False))

r3a = df3b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=True))
r3b = df3b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=False))


print("\n--- df1 ascending sort ---------------\n {}".format(r1a))
print("\n--- df1 descending sort SHOULD IT ALWAYS BE LIKE THIS? --------------\n {}".format(r1b))
print("\n--- df2 ascending sort ---------------\n {}".format(r2a))
print("\n--- df2 descending sort --------------\n {}".format(r2b))
print("\n--- df3 ascending sort SHOULD IT ALWAYS BE LIKE THIS? --------------\n {}".format(r3a))
print("\n--- df3 descending sort --------------\n {}".format(r3b))

which returns:

--- df1 ascending sort ---------------
 dtime
2017-01-01 01:05:00    11.0
2017-01-01 01:20:00    31.0
Name: value, dtype: float64

--- df1 descending sort SHOULD IT ALWAYS BE LIKE THIS? --------------
 dtime                dtime              
2017-01-01 01:00:00  2017-01-01 01:20:00    31.0
                     2017-01-01 01:05:00    11.0
Name: value, dtype: float64

--- df2 ascending sort ---------------
 dtime
2017-01-01 01:05:00    11.0
2017-01-01 01:05:00    31.0
Name: value, dtype: float64

--- df2 descending sort --------------
 dtime
2017-01-01 01:05:00    31.0
2017-01-01 01:05:00    11.0
Name: value, dtype: float64

--- df3 ascending sort SHOULD IT ALWAYS BE LIKE THIS? --------------
 dtime                dtime              
2017-01-01 01:00:00  2017-01-01 01:20:00    11.0
                     2017-01-01 01:05:00    31.0
Name: value, dtype: float64

--- df3 descending sort --------------
 dtime
2017-01-01 01:05:00    31.0
2017-01-01 01:20:00    11.0
Name: value, dtype: float64

Notice that the multi-index reference is there when the sorted values induce the time stamps to be out of order, but absent (as unnecessary) otherwise. I have done this with more than two values (and other permutations) and the behavior is the same.

Final Comment As stated, I don't think it's a bug. That said, I agree with you that it's not necessarily the most natural design choice, as it leads to strange differences like the one you've found.

Upvotes: 2

Phil Sheard
Phil Sheard

Reputation: 2162

Tested on Python 3.5 and Pandas 0.22 - using slightly different code (pd.melt() instead of df.melt).

Firstly - the reason for the difference.

df1 has values for two different timestamps (01:20:00 and 01:05:00), where df2 both observations that remain are on the same timestamp (01:05:00).

I'm not exactly sure why, but evidently the act of reversing the sort means does mean that the .values call is including the grouper time window on df1 because you're sorting values as descending within a ascending timeindex. The result is different for df2 because the timestamps are identical and so the grouper isn't required. You possibly already deduced that, so apologies for duplication if so.

Suggestion to consistently show your expected result

Normally when working with pandas.DataFrame.groupby I would use an aggregation call to manipulate the data (max in the example below but sum, count and others are available). If you're looking to see your values with the reference time of the grouper, this would be the way to approach it:

Here's df1b, with multiple timestamps against the observation values:

In []: df1b.groupby([pd.Grouper(freq="1h"), df1b.index, "variable"])\
       .max().sort_values("value", ascending=False)
Out[]:
                                                  value
dtime               dtime               variable
2017-01-01 01:00:00 2017-01-01 01:20:00 val2       31.0
                    2017-01-01 01:05:00 val1       11.0    

...and same approach for df2b:

In []: df2b.groupby([pd.Grouper(freq="1h"), df2b.index, "variable"])\
       .max().sort_values("value", ascending=False)
Out[]:
                                                  value
dtime               dtime               variable
2017-01-01 01:00:00 2017-01-01 01:05:00 val2       31.0
                                        val1       11.0

Does this help? Or have I missed the significance of why you were working with the .values method?

Disclaimer: I'm not familiar with working with groupby.value so you might be trying to achieve something I've missed.

Upvotes: 2

Related Questions