Summing over a multiindex pandas DataFrame

Question

Let's say I create the following dataframe with a df.set_index('Class','subclass'), bearing in mind there are multiple Classes with subclasses... A>Z.

Class   subclass    
  A       a           
  A       b 
  A       c 
  A       d 
  B       a            
  B       b

How would I count the subclasses in the Class and create a separate column named no of classes such that I can see the Class with the greatest number of subclasses? I was thinking some sort of for loop which runs through the Class letters and counts the subclass if that Class letter is still the same. However, this seems a bit counterintuitive for such a problem. Would there be a more simple approach such as a df.groupby[].count?

The desired output would be:

Class   subclass    No. of classes
  A       a                4    
  A       b 
  A       c 
  A       d 
  B       a                2    
  B       b

I have tried the level parameter as shown in group multi-index pandas dataframe but this doesn't seem to work for me

EDIT:

I did not mention that I wanted a return of the Class with the greatest number of subclasses. I achieved this with:

df.reset_index().groupby('Class')['subclass'].nunique().idxmax()

jezrael · Accepted Answer

You can use transform, but get duplicates values:

df['No. of classes'] = df.groupby(level='Class')['val'].transform('size')

print (df)
                val  No. of classes
Class subclass                     
A     a           1               4
      b           4               4
      c           5               4
      d           4               4
B     a           1               2
      b           2               2

But if need empty values:

df['No. of classes'] = df.groupby(level='Class')
                         .apply(lambda x: pd.Series( [len(x)] + [np.nan] * (len(x)-1)))
                         .values
print (df)
                val  No. of classes
Class subclass                     
A     a           1             4.0
      b           4             NaN
      c           5             NaN
      d           4             NaN
B     a           1             2.0
      b           2             NaN

Another solution for get Class with greatest number is:

df = df.groupby(level=['Class'])
       .apply(lambda x: x.index.get_level_values('subclass').nunique())
       .idxmax()
print (df)
A

Summing over a multiindex pandas DataFrame

Answers (2)

Related Questions