g_uint
g_uint

Reputation: 2031

How are the arguments of a function interpreted in groupby.apply in pandas?

In an Introduction to Data Science in Python course on Coursera, the following example is given:

df.groupby('Category').apply(lambda df,a,b: sum(df[a] * df[b]), 'Weight (oz.)', 'Quantity')

where df is a DataFrame, and the lambda is applied to calculate the sum of two columns.

If I understand correctly, the groupby object (returned by groupby) that the apply function is called on is a series of tuples consisting of the index that was grouped by and the part of the DataFrame that is specific to that grouping.

What I don't understand is the way that the lambda is used.

There are three arguments specified (lambda df,a,b), but only two are explicitly passed ('Weight (oz.)' and 'Quantity'). How does the interpreter know that arguments a and b are the ones specified as arguments and df is used 'as-is'?

I'm thinking this has to do with df being in scope but cannot find information to support and detail that thought.

Upvotes: 21

Views: 71166

Answers (2)

cottontail
cottontail

Reputation: 23171

An easy way to see what's passed to groupby.apply is to print it.

# sample 
df = pd.DataFrame({
    'category': ['a','a','a','b','b','b','b','c','c','c'],
    'num1': [9, 3, 1, 2, 5, 2, 8, 0, 4, 10],
    'num2': [5, 8, 8, 9, 8, 10, 8, 8, 2, 8],
    'num3': [0, 1, 4, 4, 2, 5, 5, 8, 5, 1]})


# pass print to apply
df.groupby('category').apply(print)

# maybe first two rows of each group is enough
df.groupby('category').apply(lambda g: print(g.head(2)))

which outputs:

  category  num1  num2  num3
0        a     6     1     2
1        a     1    10     9
  category  num1  num2  num3
3        b     0     5     6
4        b     1     1     5
  category  num1  num2  num3
7        c     5     3     0
8        c     4     4     6

As you can see, the dataframe is split into smaller dataframes where the category values are the same in each group (because it was used as the grouper). This is the first argument passed to lambda.

If the lambda passed to apply requires more arguments, they can be supplied either by position (arg) or by keyword (kwarg).

# args
df.groupby('category').apply(lambda g,a,b: sum(g[a] * g[b]), 'num1', 'num2')

# kwargs
df.groupby('category').apply(lambda g,a,b: sum(g[a] * g[b]), a='num1', b='num2')
#                                     ^ ^                    ^^        ^^
# category
# a     77
# b    142
# c     88
# dtype: int64

Upvotes: 1

RSHAP
RSHAP

Reputation: 2446

The apply method itself passes each "group" of the groupby object as the first argument to the function. So it knows to associate 'Weight' and "Quantity" to a and b based on position. (eg they are the 2nd and 3rd arguments if you count the first "group" argument.

df = pd.DataFrame(np.random.randint(0,11,(10,3)), columns = ['num1','num2','num3'])
df['category'] = ['a','a','a','b','b','b','b','c','c','c']
df = df[['category','num1','num2','num3']]
df

  category  num1  num2  num3
0        a     2     5     2
1        a     5     5     2
2        a     7     3     4
3        b    10     9     1
4        b     4     7     6
5        b     0     5     2
6        b     7     7     5
7        c     2     2     1
8        c     4     3     2
9        c     1     4     6

gb = df.groupby('category')

implicit argument is each "group" or in this case each category

gb.apply(lambda grp: grp.sum()) 

The "grp" is the first argument to the lambda function notice I don't have to specify anything for it as it is already, automatically taken to be each group of the groupby object

         category  num1  num2  num3
category                           
a             aaa    14    13     8
b            bbbb    21    28    14
c             ccc     7     9     9

So apply goes through each of these and performs a sum operation

print(gb.groups)
{'a': Int64Index([0, 1, 2], dtype='int64'), 'b': Int64Index([3, 4, 5, 6], dtype='int64'), 'c': Int64Index([7, 8, 9], dtype='int64')}

print('1st GROUP:\n', df.loc[gb.groups['a']])
1st GROUP:
  category  num1  num2  num3
0        a     2     5     2
1        a     5     5     2
2        a     7     3     4    


print('SUM of 1st group:\n', df.loc[gb.groups['a']].sum())

SUM of 1st group:
category    aaa
num1         14
num2         13
num3          8
dtype: object

Notice how this is the same as the first row of our previous operation

So apply is implicitly passing each group to the function argument as the first argument.

From the docs

GroupBy.apply(func, *args, **kwargs)

args, kwargs : tuple and dict

Optional positional and keyword arguments to pass to func

Additional Args passed in "*args" get passed after the implict group argument.

so using your code

gb.apply(lambda df,a,b: sum(df[a] * df[b]), 'num1', 'num2')

category
a     56
b    167
c     20
dtype: int64

here 'num1' and 'num2' are being passed as additional arguments to each call of the lambda function

So apply goes through each of these and performs your lambda operation

# copy and paste your lambda function
fun = lambda df,a,b: sum(df[a] * df[b])

print(gb.groups)
{'a': Int64Index([0, 1, 2], dtype='int64'), 'b': Int64Index([3, 4, 5, 6], dtype='int64'), 'c': Int64Index([7, 8, 9], dtype='int64')}

print('1st GROUP:\n', df.loc[gb.groups['a']])

1st GROUP:
   category  num1  num2  num3
0        a     2     5     2
1        a     5     5     2
2        a     7     3     4

print('Output of 1st group for function "fun":\n', 
fun(df.loc[gb.groups['a']], 'num1','num2'))

Output of 1st group for function "fun":
56

Upvotes: 27

Related Questions