Reputation: 377
I have a data frame:
df = pd.DataFrame({"id": [1,1,1,2,2,2,3,3,3], "elev": [10,11,2,45,5,4,4,4,9], "year": [2009,2010,2010,2009,2010,2011,2006,2007,2007]})
id elev year
1 10 2009
1 11 2010
1 2 2010
2 45 2009
2 5 2010
2 4 2011
3 4 2006
3 4 2007
3 9 2007
I want to keep only the maximum year rows for each ID:
id elev year
1 11 2010
1 2 2010
2 4 2011
3 4 2007
3 9 2007
I tried the following code:
df = df.groupby("id")
df = df.max('year')
df = df.reset_index()
But its keeps only single row for each ID.
Upvotes: 1
Views: 1155
Reputation: 28709
You need to create a boolean to filter the original dataframe with; this is possible with transform:
In [164]: df.loc[df.year.eq(df.groupby('id').year.transform('max'))]
Out[164]:
id elev year
1 1 11 2010
2 1 2 2010
5 2 4 2011
7 3 4 2007
8 3 9 2007
Breakdown: This generates the max for each ID, and appends to each relevant index.
In [165]: df.groupby('id').year.transform('max')
Out[165]:
0 2010
1 2010
2 2010
3 2011
4 2011
5 2011
6 2007
7 2007
8 2007
Name: year, dtype: int64
Step2: Compare the year with the max
df.year.eq(df.groupby('id').year.transform('max'))
0 False
1 True
2 True
3 False
4 False
5 True
6 False
7 True
8 True
Name: year, dtype: bool
Step3 : Indexing
In [164]: df.loc[df.year.eq(df.groupby('id').year.transform('max'))]
Out[164]:
id elev year
1 1 11 2010
2 1 2 2010
5 2 4 2011
7 3 4 2007
8 3 9 2007
Upvotes: 3