Reputation: 2566
Versions:
* Pandas 0.21.0
* Python 2.7.12
I see this strange behaviour in Pandas: according to the dataframe and the sort, groupby returns different results:
import pandas as pd
from datetime import datetime
df1 = pd.DataFrame({'dtime': [datetime(2017,1,1,1,5), datetime(2017,1,1,1,20)],
'val1': [11, None],
'val2': [None, 31]
})
df2 = pd.DataFrame({'dtime': [datetime(2017,1,1,1,5), datetime(2017,1,1,1,20)],
'val1': [11, None],
'val2': [31, None]
})
df1b = df1.melt("dtime").dropna().set_index("dtime")
df2b = df2.melt("dtime").dropna().set_index("dtime")
r1a = df1b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=True))
r1b = df1b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=False))
r2a = df2b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=True))
r2b = df2b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=False))
print "\n--- df1 ascending sort ---------------\n", r1a
print "\n--- df1 descending sort SHOULD IT ALWAYS BE LIKE THIS? --------------\n", r1b
print "\n--- df2 ascending sort ---------------\n", r2a
print "\n--- df2 descending sort --------------\n", r2b
return
--- df1 ascending sort ---------------
dtime
2017-01-01 01:05:00 11.0
2017-01-01 01:20:00 31.0
Name: value, dtype: float64
--- df1 descending sort SHOULD IT ALWAYS BE LIKE THIS? --------------
dtime dtime
2017-01-01 01:00:00 2017-01-01 01:20:00 31.0
2017-01-01 01:05:00 11.0
Name: value, dtype: float64
--- df2 ascending sort ---------------
dtime
2017-01-01 01:05:00 11.0
2017-01-01 01:05:00 31.0
Name: value, dtype: float64
--- df2 descending sort --------------
dtime
2017-01-01 01:05:00 31.0
2017-01-01 01:05:00 11.0
Name: value, dtype: float64
Why df1 descending sort
returns the reference time for grouping (2017-01-01 01:00:00
), while it is missing in the other groups?
I would expect that all grouping return a reference time as multi-index, without exception.
I created an issue at github pandas #18908 but in a month had no useful answer. I would like to understand if it is just a presentation problem or a possible bug. Am I missing something?
Upvotes: 1
Views: 150
Reputation: 1559
This behavior is at first unexpected, but I do not think it is a bug.
First, note that it is not ascending=False
that causes this behavior (see examples below). Second, when the grouper is called such that the multi-index reference time would be redundant (or unnecessary), it is skipped. And this occurs when the values being sorted are in the same order as the timestamps. I see @Phil Sheard has proposed an approach to get your expected result (+1 for that), so below I am just expanding on the behavior and why I think it is sensible for pd
to be doing that.
First, notice that this behavior happens equivalently on the ascending search when I make a df3
that has the val
data in the opposite order:
import pandas as pd
from datetime import datetime
df1 = pd.DataFrame({'dtime': [datetime(2017,1,1,1,5), datetime(2017,1,1,1,20)],
'val1': [11, None],
'val2': [None, 31]
})
df2 = pd.DataFrame({'dtime': [datetime(2017,1,1,1,5), datetime(2017,1,1,1,20)],
'val1': [11, None],
'val2': [31, None]
})
df3 = pd.DataFrame({'dtime': [datetime(2017,1,1,1,5), datetime(2017,1,1,1,20)],
'val1': [None, 11],
'val2': [31, None]
})
df1b = df1.melt("dtime").dropna().set_index("dtime")
df2b = df2.melt("dtime").dropna().set_index("dtime")
df3b = df3.melt("dtime").dropna().set_index("dtime")
r1a = df1b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=True))
r1b = df1b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=False))
r2a = df2b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=True))
r2b = df2b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=False))
r3a = df3b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=True))
r3b = df3b.groupby(pd.Grouper(freq="1h")).value.apply(lambda x: x.sort_values(ascending=False))
print("\n--- df1 ascending sort ---------------\n {}".format(r1a))
print("\n--- df1 descending sort SHOULD IT ALWAYS BE LIKE THIS? --------------\n {}".format(r1b))
print("\n--- df2 ascending sort ---------------\n {}".format(r2a))
print("\n--- df2 descending sort --------------\n {}".format(r2b))
print("\n--- df3 ascending sort SHOULD IT ALWAYS BE LIKE THIS? --------------\n {}".format(r3a))
print("\n--- df3 descending sort --------------\n {}".format(r3b))
which returns:
--- df1 ascending sort ---------------
dtime
2017-01-01 01:05:00 11.0
2017-01-01 01:20:00 31.0
Name: value, dtype: float64
--- df1 descending sort SHOULD IT ALWAYS BE LIKE THIS? --------------
dtime dtime
2017-01-01 01:00:00 2017-01-01 01:20:00 31.0
2017-01-01 01:05:00 11.0
Name: value, dtype: float64
--- df2 ascending sort ---------------
dtime
2017-01-01 01:05:00 11.0
2017-01-01 01:05:00 31.0
Name: value, dtype: float64
--- df2 descending sort --------------
dtime
2017-01-01 01:05:00 31.0
2017-01-01 01:05:00 11.0
Name: value, dtype: float64
--- df3 ascending sort SHOULD IT ALWAYS BE LIKE THIS? --------------
dtime dtime
2017-01-01 01:00:00 2017-01-01 01:20:00 11.0
2017-01-01 01:05:00 31.0
Name: value, dtype: float64
--- df3 descending sort --------------
dtime
2017-01-01 01:05:00 31.0
2017-01-01 01:20:00 11.0
Name: value, dtype: float64
Notice that the multi-index reference is there when the sorted values induce the time stamps to be out of order, but absent (as unnecessary) otherwise. I have done this with more than two values (and other permutations) and the behavior is the same.
Final Comment As stated, I don't think it's a bug. That said, I agree with you that it's not necessarily the most natural design choice, as it leads to strange differences like the one you've found.
Upvotes: 2
Reputation: 2162
Tested on Python 3.5 and Pandas 0.22 - using slightly different code (pd.melt()
instead of df.melt
).
Firstly - the reason for the difference.
df1
has values for two different timestamps (01:20:00 and 01:05:00), where df2 both observations that remain are on the same timestamp (01:05:00).
I'm not exactly sure why, but evidently the act of reversing the sort means does mean that the .values
call is including the grouper time window on df1
because you're sorting values as descending within a ascending timeindex. The result is different for df2
because the timestamps are identical and so the grouper isn't required. You possibly already deduced that, so apologies for duplication if so.
Suggestion to consistently show your expected result
Normally when working with pandas.DataFrame.groupby
I would use an aggregation call to manipulate the data (max
in the example below but sum
, count
and others are available). If you're looking to see your values with the reference time of the grouper, this would be the way to approach it:
Here's df1b, with multiple timestamps against the observation values:
In []: df1b.groupby([pd.Grouper(freq="1h"), df1b.index, "variable"])\
.max().sort_values("value", ascending=False)
Out[]:
value
dtime dtime variable
2017-01-01 01:00:00 2017-01-01 01:20:00 val2 31.0
2017-01-01 01:05:00 val1 11.0
...and same approach for df2b:
In []: df2b.groupby([pd.Grouper(freq="1h"), df2b.index, "variable"])\
.max().sort_values("value", ascending=False)
Out[]:
value
dtime dtime variable
2017-01-01 01:00:00 2017-01-01 01:05:00 val2 31.0
val1 11.0
Does this help? Or have I missed the significance of why you were working with the .values
method?
Disclaimer: I'm not familiar with working with groupby.value
so you might be trying to achieve something I've missed.
Upvotes: 2