Reputation: 585
def afun(group):
aa=len(group)
group.sort_values(inplace=True)
return pandas.DataFrame({'score':numpy.arange(aa),'price':group})
df = pandas.DataFrame({
'stock':numpy.repeat( ['AAPL','GOOG','YHOO'], 3 ),
'date':numpy.tile( pandas.date_range('5/5/2015', periods=3, freq='D'), 3 ),
'price':(numpy.random.randn(9).cumsum() + 10) ,
'price2':(numpy.random.randn(9).cumsum() + 10)})
df = df.set_index(['stock','date'])
agroupDf=df.groupby(level='date')
tt=agroupDf['price'].apply(afun)
the value of variable tt is shown in the figure
my question is why tt has two columns 'date',and how to avoid the secend column 'date'?
Upvotes: 10
Views: 5504
Reputation: 1696
This is also happening to me when I'm using level
to groupby
with a MultiIndex
DataFrame
.
Fix: use the group_keys=False
flag. It's True
by default from V2.0.0
:
df.set_index(["lvl1", "lvl2"], inplace=True)
df = df.groupby(level="lvl1", group_keys=False).apply(func)
Upvotes: 2
Reputation: 510
Alternative approach is to set as_index = False
in the groupby, i.e:
agroupDf=df.groupby(level='date', as_index=False)
This returns:
score price
stock date
0 YHOO 2015-05-05 0 5.021206
GOOG 2015-05-05 1 7.544137
AAPL 2015-05-05 2 9.153860
1 YHOO 2015-05-06 0 4.649305
GOOG 2015-05-06 1 6.862662
AAPL 2015-05-06 2 7.869256
2 GOOG 2015-05-07 0 6.149092
YHOO 2015-05-07 1 6.269954
AAPL 2015-05-07 2 8.185481
check Pandas documentation for details.
Upvotes: 3
Reputation: 153
I ran into this same issue and identified a solution, though it's not clear if its applicable here, because the code cited in the question would return an error for me. Anyway, here is the setup for a toy example:
import numpy as np
from pandas import DataFrame,Series
def my_func(group):
"""Demonstrator function for duplicate index"""
_sum = group[0].sum()
_mean = group[0].mean()
_median = group[0].median()
return DataFrame({'sum':_sum,'mean':_mean,'median':_median},
index=[group.first_valid_index()])
df = DataFrame(np.random.rand(10,5))
df.head()
0 1 2 3 4
0 0.779801 0.704764 0.042028 0.831537 0.359437
1 0.515832 0.561174 0.530190 0.171031 0.927605
2 0.612827 0.730794 0.406032 0.192046 0.227649
3 0.698508 0.261531 0.919857 0.102664 0.506072
4 0.239987 0.871934 0.660110 0.402880 0.902426
df.groupby(level=0).apply(my_func)
sum mean median
0 0 0.779801 0.779801 0.779801
1 1 0.515832 0.515832 0.515832
2 2 0.612827 0.612827 0.612827
3 3 0.698508 0.698508 0.698508
4 4 0.239987 0.239987 0.239987
First, not including the index of your choice in the DataFrame constructor:
index=[group.first_valid_index()]
Should return an error if your groupby function is operating on a single column. For some reason with multiple columns it does not. In that case you get a multiindex without duplication returned.
Here, though, we are looking at a single column and included the index pointer and clearly get a duplicate index back. Turns out the solution is to return a Series instead of Dataframe:
def my_func_series(group):
"""Demonstrator function for duplicate index"""
_sum = group[0].sum()
_mean = group[0].mean()
_median = group[0].median()
return Series({'sum':_sum,'mean':_mean,'median':_median})
Now, the groupby.apply does not return a duplicate index:
df.groupby(level=0).apply(my_func_series)
sum mean median
0 0.779801 0.779801 0.779801
1 0.515832 0.515832 0.515832
2 0.612827 0.612827 0.612827
3 0.698508 0.698508 0.698508
4 0.239987 0.239987 0.239987
Since each row of a DataFrame is itself a series, and your goal is to return a single row per group, this makes sense.
Upvotes: 1
Reputation: 25639
The first one is the groupby
'date'. The second one is the index
'date'.
changing things around - this time groupby
stock:
df = df.set_index(['date','stock'])
agroupDf = df.groupby(level='stock')
tt=agroupDf['price'].apply(afun)
tt
price score
stock date stock
AAPL 2015-05-05 AAPL 9.333143 0
2015-05-06 AAPL 9.680022 1
2015-05-07 AAPL 9.870889 2
GOOG 2015-05-06 GOOG 10.030032 0
2015-05-05 GOOG 10.229084 1
2015-05-07 GOOG 10.571631 2
YHOO 2015-05-07 YHOO 9.996925 0
2015-05-05 YHOO 10.342180 1
2015-05-06 YHOO 10.586120 2
I think you want this:
df = df.set_index('stock')
agroupDf = df.groupby('date')
tt=agroupDf['price'].apply(afun)
tt
price score
date stock
2015-05-05 AAPL 10.414396 0
GOOG 12.608225 1
YHOO 12.830496 2
2015-05-06 AAPL 10.428767 0
GOOG 11.189663 1
YHOO 11.988177 2
2015-05-07 YHOO 11.202677 0
AAPL 11.274440 1
GOOG 11.780654 2
Upvotes: 5