Reputation: 585

why groupby.apply return duplicate level

def afun(group):
  aa=len(group)
  group.sort_values(inplace=True)

  return pandas.DataFrame({'score':numpy.arange(aa),'price':group})



df = pandas.DataFrame({
              'stock':numpy.repeat( ['AAPL','GOOG','YHOO'], 3 ),
              'date':numpy.tile( pandas.date_range('5/5/2015', periods=3, freq='D'), 3 ),
              'price':(numpy.random.randn(9).cumsum() + 10) ,
              'price2':(numpy.random.randn(9).cumsum() + 10)})
df = df.set_index(['stock','date'])
agroupDf=df.groupby(level='date') 



tt=agroupDf['price'].apply(afun)

the value of variable tt is shown in the figure

my question is why tt has two columns 'date',and how to avoid the secend column 'date'?

Upvotes: 10

Answers (4)

Wesley Cheek

Reputation: 1696

This is also happening to me when I'm using level to groupby with a MultiIndex DataFrame.

Fix: use the group_keys=False flag. It's True by default from V2.0.0:

df.set_index(["lvl1", "lvl2"], inplace=True)
df = df.groupby(level="lvl1", group_keys=False).apply(func)

Upvotes: 2

dusio

Reputation: 510

Alternative approach is to set as_index = False in the groupby, i.e:

agroupDf=df.groupby(level='date', as_index=False)

This returns:

                    score     price
  stock date                       
0 YHOO  2015-05-05      0  5.021206
  GOOG  2015-05-05      1  7.544137
  AAPL  2015-05-05      2  9.153860
1 YHOO  2015-05-06      0  4.649305
  GOOG  2015-05-06      1  6.862662
  AAPL  2015-05-06      2  7.869256
2 GOOG  2015-05-07      0  6.149092
  YHOO  2015-05-07      1  6.269954
  AAPL  2015-05-07      2  8.185481

check Pandas documentation for details.

Upvotes: 3

Richard Rymer

Reputation: 153

I ran into this same issue and identified a solution, though it's not clear if its applicable here, because the code cited in the question would return an error for me. Anyway, here is the setup for a toy example:

import numpy as np
from pandas import DataFrame,Series

def my_func(group):
"""Demonstrator function for duplicate index"""
_sum = group[0].sum()
_mean = group[0].mean()
_median = group[0].median()
return DataFrame({'sum':_sum,'mean':_mean,'median':_median},
                 index=[group.first_valid_index()])

df = DataFrame(np.random.rand(10,5))
df.head()
    0   1   2   3   4
0   0.779801    0.704764    0.042028    0.831537    0.359437
1   0.515832    0.561174    0.530190    0.171031    0.927605
2   0.612827    0.730794    0.406032    0.192046    0.227649
3   0.698508    0.261531    0.919857    0.102664    0.506072
4   0.239987    0.871934    0.660110    0.402880    0.902426
df.groupby(level=0).apply(my_func)
         sum    mean    median
0   0   0.779801    0.779801    0.779801
1   1   0.515832    0.515832    0.515832
2   2   0.612827    0.612827    0.612827
3   3   0.698508    0.698508    0.698508
4   4   0.239987    0.239987    0.239987

First, not including the index of your choice in the DataFrame constructor:

index=[group.first_valid_index()]

Should return an error if your groupby function is operating on a single column. For some reason with multiple columns it does not. In that case you get a multiindex without duplication returned.

Here, though, we are looking at a single column and included the index pointer and clearly get a duplicate index back. Turns out the solution is to return a Series instead of Dataframe:

def my_func_series(group):
"""Demonstrator function for duplicate index"""
_sum = group[0].sum()
_mean = group[0].mean()
_median = group[0].median()
return Series({'sum':_sum,'mean':_mean,'median':_median})

Now, the groupby.apply does not return a duplicate index:

df.groupby(level=0).apply(my_func_series)
    sum mean    median
0   0.779801    0.779801    0.779801
1   0.515832    0.515832    0.515832
2   0.612827    0.612827    0.612827
3   0.698508    0.698508    0.698508
4   0.239987    0.239987    0.239987

Since each row of a DataFrame is itself a series, and your goal is to return a single row per group, this makes sense.

Upvotes: 1

Merlin

Reputation: 25639

The first one is the groupby 'date'. The second one is the index 'date'.

changing things around - this time groupby stock:

df       = df.set_index(['date','stock'])
agroupDf = df.groupby(level='stock') 

tt=agroupDf['price'].apply(afun)
tt


                            price  score
stock date       stock                  
AAPL  2015-05-05 AAPL    9.333143      0
      2015-05-06 AAPL    9.680022      1
      2015-05-07 AAPL    9.870889      2
GOOG  2015-05-06 GOOG   10.030032      0
      2015-05-05 GOOG   10.229084      1
      2015-05-07 GOOG   10.571631      2
YHOO  2015-05-07 YHOO    9.996925      0
      2015-05-05 YHOO   10.342180      1
      2015-05-06 YHOO   10.586120      2

I think you want this:

df       = df.set_index('stock')
agroupDf = df.groupby('date') 

tt=agroupDf['price'].apply(afun)
tt

                          price  score
date       stock                  
2015-05-05 AAPL   10.414396      0
           GOOG   12.608225      1
           YHOO   12.830496      2
2015-05-06 AAPL   10.428767      0
           GOOG   11.189663      1
           YHOO   11.988177      2
2015-05-07 YHOO   11.202677      0
           AAPL   11.274440      1
           GOOG   11.780654      2

Upvotes: 5

why groupby.apply return duplicate level

Answers (4)

Related Questions