inaMinute
inaMinute

Reputation: 585

why groupby.apply return duplicate level

def afun(group):
  aa=len(group)
  group.sort_values(inplace=True)

  return pandas.DataFrame({'score':numpy.arange(aa),'price':group})



df = pandas.DataFrame({
              'stock':numpy.repeat( ['AAPL','GOOG','YHOO'], 3 ),
              'date':numpy.tile( pandas.date_range('5/5/2015', periods=3, freq='D'), 3 ),
              'price':(numpy.random.randn(9).cumsum() + 10) ,
              'price2':(numpy.random.randn(9).cumsum() + 10)})
df = df.set_index(['stock','date'])
agroupDf=df.groupby(level='date') 



tt=agroupDf['price'].apply(afun)

the value of variable tt is shown in the figure

my question is why tt has two columns 'date',and how to avoid the secend column 'date'?

enter image description here

Upvotes: 10

Views: 5504

Answers (4)

Wesley Cheek
Wesley Cheek

Reputation: 1696

This is also happening to me when I'm using level to groupby with a MultiIndex DataFrame.

Fix: use the group_keys=False flag. It's True by default from V2.0.0:

df.set_index(["lvl1", "lvl2"], inplace=True)
df = df.groupby(level="lvl1", group_keys=False).apply(func)

Upvotes: 2

dusio
dusio

Reputation: 510

Alternative approach is to set as_index = False in the groupby, i.e:

agroupDf=df.groupby(level='date', as_index=False)

This returns:

                    score     price
  stock date                       
0 YHOO  2015-05-05      0  5.021206
  GOOG  2015-05-05      1  7.544137
  AAPL  2015-05-05      2  9.153860
1 YHOO  2015-05-06      0  4.649305
  GOOG  2015-05-06      1  6.862662
  AAPL  2015-05-06      2  7.869256
2 GOOG  2015-05-07      0  6.149092
  YHOO  2015-05-07      1  6.269954
  AAPL  2015-05-07      2  8.185481

check Pandas documentation for details.

Upvotes: 3

Richard Rymer
Richard Rymer

Reputation: 153

I ran into this same issue and identified a solution, though it's not clear if its applicable here, because the code cited in the question would return an error for me. Anyway, here is the setup for a toy example:

import numpy as np
from pandas import DataFrame,Series

def my_func(group):
"""Demonstrator function for duplicate index"""
_sum = group[0].sum()
_mean = group[0].mean()
_median = group[0].median()
return DataFrame({'sum':_sum,'mean':_mean,'median':_median},
                 index=[group.first_valid_index()])

df = DataFrame(np.random.rand(10,5))
df.head()
    0   1   2   3   4
0   0.779801    0.704764    0.042028    0.831537    0.359437
1   0.515832    0.561174    0.530190    0.171031    0.927605
2   0.612827    0.730794    0.406032    0.192046    0.227649
3   0.698508    0.261531    0.919857    0.102664    0.506072
4   0.239987    0.871934    0.660110    0.402880    0.902426
df.groupby(level=0).apply(my_func)
         sum    mean    median
0   0   0.779801    0.779801    0.779801
1   1   0.515832    0.515832    0.515832
2   2   0.612827    0.612827    0.612827
3   3   0.698508    0.698508    0.698508
4   4   0.239987    0.239987    0.239987

First, not including the index of your choice in the DataFrame constructor:

index=[group.first_valid_index()]

Should return an error if your groupby function is operating on a single column. For some reason with multiple columns it does not. In that case you get a multiindex without duplication returned.

Here, though, we are looking at a single column and included the index pointer and clearly get a duplicate index back. Turns out the solution is to return a Series instead of Dataframe:

def my_func_series(group):
"""Demonstrator function for duplicate index"""
_sum = group[0].sum()
_mean = group[0].mean()
_median = group[0].median()
return Series({'sum':_sum,'mean':_mean,'median':_median})

Now, the groupby.apply does not return a duplicate index:

df.groupby(level=0).apply(my_func_series)
    sum mean    median
0   0.779801    0.779801    0.779801
1   0.515832    0.515832    0.515832
2   0.612827    0.612827    0.612827
3   0.698508    0.698508    0.698508
4   0.239987    0.239987    0.239987

Since each row of a DataFrame is itself a series, and your goal is to return a single row per group, this makes sense.

Upvotes: 1

Merlin
Merlin

Reputation: 25639

The first one is the groupby 'date'. The second one is the index 'date'.

changing things around - this time groupby stock:

df       = df.set_index(['date','stock'])
agroupDf = df.groupby(level='stock') 

tt=agroupDf['price'].apply(afun)
tt


                            price  score
stock date       stock                  
AAPL  2015-05-05 AAPL    9.333143      0
      2015-05-06 AAPL    9.680022      1
      2015-05-07 AAPL    9.870889      2
GOOG  2015-05-06 GOOG   10.030032      0
      2015-05-05 GOOG   10.229084      1
      2015-05-07 GOOG   10.571631      2
YHOO  2015-05-07 YHOO    9.996925      0
      2015-05-05 YHOO   10.342180      1
      2015-05-06 YHOO   10.586120      2

I think you want this:

df       = df.set_index('stock')
agroupDf = df.groupby('date') 

tt=agroupDf['price'].apply(afun)
tt

                          price  score
date       stock                  
2015-05-05 AAPL   10.414396      0
           GOOG   12.608225      1
           YHOO   12.830496      2
2015-05-06 AAPL   10.428767      0
           GOOG   11.189663      1
           YHOO   11.988177      2
2015-05-07 YHOO   11.202677      0
           AAPL   11.274440      1
           GOOG   11.780654      2

Upvotes: 5

Related Questions