pythonpandasdataframepandas-groupbypandas-apply

Reputation: 3823

python pandas groupby/apply: what exactly is passed to the apply function?

Python newbie here. I'm trying to understand how the pandas groupby and apply methods work. I found this simple example, which I paste below:

import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}

df = pd.DataFrame(ipl_data)

The dataframe df looks like this:

      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
2   Devils     2  2014     863
3   Devils     3  2015     673
4    Kings     3  2014     741
5    kings     4  2015     812
6    Kings     1  2016     756
7    Kings     1  2017     788
8   Riders     2  2016     694
9   Royals     4  2014     701
10  Royals     1  2015     804
11  Riders     2  2017     690

So far, so good. I would then like to transform my data so that from every group of teams I'd only keep the very first element from the Points column. Having first checked that df['Points'][0] does indeed give me the first Points element of df, I tried this:

df.groupby('Team').apply(lambda x : x['Points'][0])

thinking that the argument x to the lambda function is another pandas dataframe. However, python yields an error:

File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

which seems to have something to do with a HashTable but I am unable to understand why. I then thought that maybe what is passed to the lambda is not a dataframe, so I ran this:

df.groupby('Team').apply(lambda x : (type(x), x.shape))

with output:

Team
Devils    (<class 'pandas.core.frame.DataFrame'>, (2, 4))
Kings     (<class 'pandas.core.frame.DataFrame'>, (3, 4))
Riders    (<class 'pandas.core.frame.DataFrame'>, (4, 4))
Royals    (<class 'pandas.core.frame.DataFrame'>, (2, 4))
kings     (<class 'pandas.core.frame.DataFrame'>, (1, 4))
dtype: object

which, IIUC, shows that the the argument to the lambda is indeed a pandas dataframe holding each team's subset of df.

I know I can get the desired result by running:

df.groupby('Team').apply(lambda x : x['Points'].iloc[0])

I just want to understand why df['Points'][0] works and x['Points'][0] doesn't from within the apply function. Thank you for reading!

Upvotes: 6

Answers (4)

Laurent B.

Reputation: 2273

I added a mere function to vizualize what happens during the process :

import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}

df = pd.DataFrame(ipl_data)

n=1
def f(chunk):
    global n
    print("This is the chunk n° {0}".format(n))
    print(chunk)
    n+=1
    
df.groupby('Team').apply(lambda x : f(x))

The result shows that f function is called 5 times corresponding to the number of created groups.

Each time f receives a sub-group the variable n is incremented.

In your sample, you only have 5 different teams so a total of 5 groups passed one by one to the apply function :

This is the chunk n° 1
     Team  Rank  Year  Points
2  Devils     2  2014     863
3  Devils     3  2015     673

This is the chunk n° 2
    Team  Rank  Year  Points
4  Kings     3  2014     741
6  Kings     1  2016     756
7  Kings     1  2017     788

This is the chunk n° 3
      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
8   Riders     2  2016     694
11  Riders     2  2017     690

This is the chunk n° 4
      Team  Rank  Year  Points
9   Royals     4  2014     701
10  Royals     1  2015     804

This is the chunk n° 5
    Team  Rank  Year  Points
5  kings     4  2015     812

Upvotes: 0

denis

Reputation: 21947

For the title question,

agroupby = df.groupby(...)
help( agroupby.apply )  # or in IPython xx.<tab> for help( xx )

apply(func, *args, **kwargs) method of pandas.core.groupby.generic.DataFrameGroupBy instance

Apply function func group-wise and combine the results together.

The function passed to apply must take a dataframe as its first argument and return a DataFrame, Series or scalar. apply will then take care of combining the results back together into a single dataframe or series.

Upvotes: 1

Code Different

Reputation: 93191

When you call df.groupby('Team').apply(lambda x: ...) you are essentially chopping up the dataframe by Team and pass each chunk to the lambda function:

      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
8   Riders     2  2016     694
11  Riders     2  2017     690
------------------------------
2   Devils     2  2014     863
3   Devils     3  2015     673
------------------------------
4    Kings     3  2014     741
6    Kings     1  2016     756
7    Kings     1  2017     788
------------------------------
5    kings     4  2015     812
------------------------------
9   Royals     4  2014     701
10  Royals     1  2015     804

df['Points'][0] works because you are telling pandas to "get the value at label 0 of the Points series", which exists.

.apply(lambda x: x['Points'][0]) doesn't work because only 1 chunk (Riders) has a label 0. Hence you get the Key Error.

Having said that, apply is generic so it's pretty slow compared to the builtin vectorized aggregate functions. You can use first:

df.groupby('Team')['Points'].first()

Upvotes: 7

Sarath Reddy K

Reputation: 11

Apply function takes each row and process the data, so Apply really doesn't understand the index (like [0]) you are passing to it, hence the error. It works with df, as index remain works with df.

You may try something like this to achieve the first point for each team.

df.drop_duplicates(subset=['Team'])

Ouput:

    Team    Rank    Year    Points
0   Riders  1   2014    876
2   Devils  2   2014    863
4   Kings   3   2014    741
5   kings   4   2015    812
9   Royals  4   2014    701

In case you need to keep max/min points row, you can sort the df before dropping duplicates.Hope that helps.

Upvotes: 1

python pandas groupby/apply: what exactly is passed to the apply function?

Answers (4)

Related Questions