vishnu prashanth
vishnu prashanth

Reputation: 419

how to divide the sum of a groupby value with the count the another value

I wanted to groupby by 'label' and 'month' to sum the Quantity sold for each month and for each label.

Dataset

I am trying to do 'groupby and apply' method for achieving this, but not sure how to count the month for each label. Say, for label value AFFLELOU (DOS), I have two values for 7th month. so, I should sum the quantity sold and divide by 2. for 9th and 10th month, I just have one value, so the count would be 1 and it would divide the quantity sold.

I wrote the code below, but it doesn't take count as a function and return count not defined error.

t1.groupby(['label', 'month']).apply(lambda x: x['Quantity sold'] 
.sum()/count('month'))

Can someone tell me how to get the count value of each month for each label?

Thanks in Advance.

Upvotes: 6

Views: 9267

Answers (1)

unutbu
unutbu

Reputation: 881027

Instead of summing, counting and dividing, you could use agg('mean'):

t1.groupby(['label', 'month'])['Quantity sold'].agg('mean') 

Or, if you do wish to retain the sum and count, use:

t1.groupby(['label', 'month'])['Quantity sold'].agg(['sum', 'count', 'mean']) 

For example,

import numpy as np
import pandas as pd

t1 = pd.DataFrame(np.random.randint(4, size=(20,3)), columns=['label', 'Quantity sold', 'month'])
t1.groupby(['label', 'month'])['Quantity sold'].agg(['sum', 'count', 'mean']) 

yields a DataFrame like

             sum  count  mean
label month                  
0     1        2      1  2.00
      2        0      1  0.00
      3        2      2  1.00
1     1        1      2  0.50
      2        3      1  3.00
      3        1      1  1.00
2     0        0      1  0.00
      1        0      3  0.00
      3        5      4  1.25
3     0        1      1  1.00
      1        0      1  0.00
      2        0      1  0.00
      3        3      1  3.00

Using groupby/agg with its builtin aggregators sum, count and mean is clearly more convenient here, but if you did need to use groupby/apply with a custom function you could use:

t1.groupby(['label', 'month']).apply(lambda x: x['Quantity sold'].sum()/len(x))

Note that while calling custom functions with groupby/apply gives you more flexibility, it comes at a cost because calling a custom Python function once for each group is generally slower than calling the builtin Cythonized aggregators available in groupby/agg.


If you have missing (NaN) values in Quantity sold, it may help to know that group/agg has both 'count' and 'size' aggregators:

  • 'count' returns the number of non-NaN values
  • 'size' returns the length of the group (including NaN values)

The count is always less than or equal to the size. The mean is the sum (of the non-NaN values) divided by the count. To see the difference between count and size, you could experiment with this code:

np.random.seed(2018)
t1 = pd.DataFrame(np.random.randint(4, size=(50,3)), columns=['label', 'Quantity sold', 'month'])
t1.loc[np.random.choice([True, False], len(t1)), 'Quantity sold'] = np.nan
t1.groupby(['label', 'month'])['Quantity sold'].agg(['sum', 'count', 'size', 'mean']) 

which yields

             sum  count  size      mean
label month                            
0     1      0.0      0     3       NaN
      2      6.0      2     2  3.000000
      3      0.0      0     1       NaN
1     0      3.0      2     5  1.500000
      1      0.0      0     1       NaN
      2      5.0      3     5  1.666667
      3      0.0      2     3  0.000000
2     0      7.0      3     5  2.333333
      1      4.0      4     8  1.000000
      2      5.0      2     3  2.500000
      3      5.0      2     3  2.500000
3     0      1.0      2     5  0.500000
      1      3.0      1     1  3.000000
      2      2.0      1     2  2.000000
      3      2.0      1     3  2.000000

Upvotes: 7

Related Questions