how to divide the sum of a groupby value with the count the another value

Question

I wanted to groupby by 'label' and 'month' to sum the Quantity sold for each month and for each label.

I am trying to do 'groupby and apply' method for achieving this, but not sure how to count the month for each label. Say, for label value AFFLELOU (DOS), I have two values for 7th month. so, I should sum the quantity sold and divide by 2. for 9th and 10th month, I just have one value, so the count would be 1 and it would divide the quantity sold.

I wrote the code below, but it doesn't take count as a function and return count not defined error.

t1.groupby(['label', 'month']).apply(lambda x: x['Quantity sold'] 
.sum()/count('month'))

Can someone tell me how to get the count value of each month for each label?

Thanks in Advance.

unutbu · Accepted Answer

Instead of summing, counting and dividing, you could use agg('mean'):

t1.groupby(['label', 'month'])['Quantity sold'].agg('mean')

Or, if you do wish to retain the sum and count, use:

t1.groupby(['label', 'month'])['Quantity sold'].agg(['sum', 'count', 'mean'])

For example,

import numpy as np
import pandas as pd

t1 = pd.DataFrame(np.random.randint(4, size=(20,3)), columns=['label', 'Quantity sold', 'month'])
t1.groupby(['label', 'month'])['Quantity sold'].agg(['sum', 'count', 'mean'])

yields a DataFrame like

             sum  count  mean
label month                  
0     1        2      1  2.00
      2        0      1  0.00
      3        2      2  1.00
1     1        1      2  0.50
      2        3      1  3.00
      3        1      1  1.00
2     0        0      1  0.00
      1        0      3  0.00
      3        5      4  1.25
3     0        1      1  1.00
      1        0      1  0.00
      2        0      1  0.00
      3        3      1  3.00

Using groupby/agg with its builtin aggregators sum, count and mean is clearly more convenient here, but if you did need to use groupby/apply with a custom function you could use:

t1.groupby(['label', 'month']).apply(lambda x: x['Quantity sold'].sum()/len(x))

Note that while calling custom functions with groupby/apply gives you more flexibility, it comes at a cost because calling a custom Python function once for each group is generally slower than calling the builtin Cythonized aggregators available in groupby/agg.

If you have missing (NaN) values in Quantity sold, it may help to know that group/agg has both 'count' and 'size' aggregators:

'count' returns the number of non-NaN values
'size' returns the length of the group (including NaN values)

The count is always less than or equal to the size. The mean is the sum (of the non-NaN values) divided by the count. To see the difference between count and size, you could experiment with this code:

np.random.seed(2018)
t1 = pd.DataFrame(np.random.randint(4, size=(50,3)), columns=['label', 'Quantity sold', 'month'])
t1.loc[np.random.choice([True, False], len(t1)), 'Quantity sold'] = np.nan
t1.groupby(['label', 'month'])['Quantity sold'].agg(['sum', 'count', 'size', 'mean'])

which yields

             sum  count  size      mean
label month                            
0     1      0.0      0     3       NaN
      2      6.0      2     2  3.000000
      3      0.0      0     1       NaN
1     0      3.0      2     5  1.500000
      1      0.0      0     1       NaN
      2      5.0      3     5  1.666667
      3      0.0      2     3  0.000000
2     0      7.0      3     5  2.333333
      1      4.0      4     8  1.000000
      2      5.0      2     3  2.500000
      3      5.0      2     3  2.500000
3     0      1.0      2     5  0.500000
      1      3.0      1     1  3.000000
      2      2.0      1     2  2.000000
      3      2.0      1     3  2.000000

how to divide the sum of a groupby value with the count the another value

Answers (1)

Related Questions