Reputation: 3823
Python newbie here. I'm trying to understand how the pandas groupby and apply methods work. I found this simple example, which I paste below:
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
The dataframe df
looks like this:
Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
2 Devils 2 2014 863
3 Devils 3 2015 673
4 Kings 3 2014 741
5 kings 4 2015 812
6 Kings 1 2016 756
7 Kings 1 2017 788
8 Riders 2 2016 694
9 Royals 4 2014 701
10 Royals 1 2015 804
11 Riders 2 2017 690
So far, so good. I would then like to transform my data so that from every group of teams I'd only keep the very first element from the Points column. Having first checked that df['Points'][0]
does indeed give me the first Points
element of df
, I tried this:
df.groupby('Team').apply(lambda x : x['Points'][0])
thinking that the argument x
to the lambda
function is another pandas dataframe. However, python yields an error:
File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
which seems to have something to do with a HashTable but I am unable to understand why. I then thought that maybe what is passed to the lambda
is not a dataframe, so I ran this:
df.groupby('Team').apply(lambda x : (type(x), x.shape))
with output:
Team
Devils (<class 'pandas.core.frame.DataFrame'>, (2, 4))
Kings (<class 'pandas.core.frame.DataFrame'>, (3, 4))
Riders (<class 'pandas.core.frame.DataFrame'>, (4, 4))
Royals (<class 'pandas.core.frame.DataFrame'>, (2, 4))
kings (<class 'pandas.core.frame.DataFrame'>, (1, 4))
dtype: object
which, IIUC, shows that the the argument to the lambda
is indeed a pandas dataframe holding each team's subset of df
.
I know I can get the desired result by running:
df.groupby('Team').apply(lambda x : x['Points'].iloc[0])
I just want to understand why df['Points'][0]
works and x['Points'][0]
doesn't from within the apply function. Thank you for reading!
Upvotes: 6
Views: 6447
Reputation: 2263
I added a mere function to vizualize what happens during the process :
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
n=1
def f(chunk):
global n
print("This is the chunk n° {0}".format(n))
print(chunk)
n+=1
df.groupby('Team').apply(lambda x : f(x))
The result shows that f function is called 5 times corresponding to the number of created groups.
Each time f receives a sub-group the variable n is incremented.
In your sample, you only have 5 different teams so a total of 5 groups passed one by one to the apply function :
This is the chunk n° 1
Team Rank Year Points
2 Devils 2 2014 863
3 Devils 3 2015 673
This is the chunk n° 2
Team Rank Year Points
4 Kings 3 2014 741
6 Kings 1 2016 756
7 Kings 1 2017 788
This is the chunk n° 3
Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
8 Riders 2 2016 694
11 Riders 2 2017 690
This is the chunk n° 4
Team Rank Year Points
9 Royals 4 2014 701
10 Royals 1 2015 804
This is the chunk n° 5
Team Rank Year Points
5 kings 4 2015 812
Upvotes: 0
Reputation: 21947
For the title question,
agroupby = df.groupby(...)
help( agroupby.apply ) # or in IPython xx.<tab> for help( xx )
apply(func, *args, **kwargs) method of pandas.core.groupby.generic.DataFrameGroupBy instance
Apply function
func
group-wise and combine the results together.The function passed to
apply
must take a dataframe as its first argument and return a DataFrame, Series or scalar.apply
will then take care of combining the results back together into a single dataframe or series.
Upvotes: 1
Reputation: 93141
When you call df.groupby('Team').apply(lambda x: ...)
you are essentially chopping up the dataframe by Team and pass each chunk to the lambda function:
Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
8 Riders 2 2016 694
11 Riders 2 2017 690
------------------------------
2 Devils 2 2014 863
3 Devils 3 2015 673
------------------------------
4 Kings 3 2014 741
6 Kings 1 2016 756
7 Kings 1 2017 788
------------------------------
5 kings 4 2015 812
------------------------------
9 Royals 4 2014 701
10 Royals 1 2015 804
df['Points'][0]
works because you are telling pandas to "get the value at label 0 of the Points
series", which exists.
.apply(lambda x: x['Points'][0])
doesn't work because only 1 chunk (Riders
) has a label 0. Hence you get the Key Error.
Having said that, apply
is generic so it's pretty slow compared to the builtin vectorized aggregate functions. You can use first
:
df.groupby('Team')['Points'].first()
Upvotes: 7
Reputation: 11
Apply function takes each row and process the data, so Apply really doesn't understand the index (like [0]) you are passing to it, hence the error. It works with df, as index remain works with df.
You may try something like this to achieve the first point for each team.
df.drop_duplicates(subset=['Team'])
Ouput:
Team Rank Year Points
0 Riders 1 2014 876
2 Devils 2 2014 863
4 Kings 3 2014 741
5 kings 4 2015 812
9 Royals 4 2014 701
In case you need to keep max/min points row, you can sort the df before dropping duplicates.Hope that helps.
Upvotes: 1