Reputation: 419
I wanted to groupby by 'label' and 'month' to sum the Quantity sold for each month and for each label.
I am trying to do 'groupby and apply' method for achieving this, but not sure how to count the month for each label. Say, for label value AFFLELOU (DOS), I have two values for 7th month. so, I should sum the quantity sold and divide by 2. for 9th and 10th month, I just have one value, so the count would be 1 and it would divide the quantity sold.
I wrote the code below, but it doesn't take count as a function and return count not defined error.
t1.groupby(['label', 'month']).apply(lambda x: x['Quantity sold']
.sum()/count('month'))
Can someone tell me how to get the count value of each month for each label?
Thanks in Advance.
Upvotes: 6
Views: 9267
Reputation: 881027
Instead of summing, counting and dividing, you could use agg('mean')
:
t1.groupby(['label', 'month'])['Quantity sold'].agg('mean')
Or, if you do wish to retain the sum and count, use:
t1.groupby(['label', 'month'])['Quantity sold'].agg(['sum', 'count', 'mean'])
For example,
import numpy as np
import pandas as pd
t1 = pd.DataFrame(np.random.randint(4, size=(20,3)), columns=['label', 'Quantity sold', 'month'])
t1.groupby(['label', 'month'])['Quantity sold'].agg(['sum', 'count', 'mean'])
yields a DataFrame like
sum count mean
label month
0 1 2 1 2.00
2 0 1 0.00
3 2 2 1.00
1 1 1 2 0.50
2 3 1 3.00
3 1 1 1.00
2 0 0 1 0.00
1 0 3 0.00
3 5 4 1.25
3 0 1 1 1.00
1 0 1 0.00
2 0 1 0.00
3 3 1 3.00
Using groupby/agg
with its builtin aggregators sum
, count
and mean
is clearly more convenient here, but if you did need to use groupby/apply
with a custom function you could use:
t1.groupby(['label', 'month']).apply(lambda x: x['Quantity sold'].sum()/len(x))
Note that while calling custom functions with groupby/apply
gives you more flexibility, it comes at a cost because calling a custom Python function once for each group is generally slower than calling the builtin Cythonized aggregators available in groupby/agg
.
If you have missing (NaN
) values in Quantity sold
, it may help to know that group/agg
has both 'count'
and 'size'
aggregators:
'count'
returns the number of non-NaN
values'size'
returns the length of the group (including NaN
values)The count
is always less than or equal to the size
. The mean
is the sum
(of the non-NaN
values) divided by the count
. To see the difference between count
and size
, you could experiment with this code:
np.random.seed(2018)
t1 = pd.DataFrame(np.random.randint(4, size=(50,3)), columns=['label', 'Quantity sold', 'month'])
t1.loc[np.random.choice([True, False], len(t1)), 'Quantity sold'] = np.nan
t1.groupby(['label', 'month'])['Quantity sold'].agg(['sum', 'count', 'size', 'mean'])
which yields
sum count size mean
label month
0 1 0.0 0 3 NaN
2 6.0 2 2 3.000000
3 0.0 0 1 NaN
1 0 3.0 2 5 1.500000
1 0.0 0 1 NaN
2 5.0 3 5 1.666667
3 0.0 2 3 0.000000
2 0 7.0 3 5 2.333333
1 4.0 4 8 1.000000
2 5.0 2 3 2.500000
3 5.0 2 3 2.500000
3 0 1.0 2 5 0.500000
1 3.0 1 1 3.000000
2 2.0 1 2 2.000000
3 2.0 1 3 2.000000
Upvotes: 7