Reputation: 5590
I'm attempting to groupby
several columns in a DataFrame
and only keep the rows where a specific column changed values.
Here is a sample of the dataset
Id Type Size StoredAt
107 1 20 2019-08-01
107 1 20 2019-08-02
107 1 20 2019-07-31
107 1 20 2019-07-30
107 3 20 2019-07-29
108 1 20 2019-08-01
108 1 20 2019-08-02
108 3 20 2019-07-31
What I'm attempting to do is groupby
so I only have two rows that shows when the type went from a 1 to a three. Here is what I want the DataFrame
to look like.
Id Type Size StoredAt
107 1 20 2019-08-01
107 3 20 2019-07-29
108 1 20 2019-08-01
108 3 20 2019-07-31
All I have so far
dataFrame.groupby(['Id', 'Type', 'Size'])['Id', 'Type', 'Size', 'StoredAt']
Which isn't returning what I'm expecting.
Upvotes: 1
Views: 151
Reputation: 323226
I think drop_duplicates
work for you
df.drop_duplicates(['Id','Type'])
Out[256]:
Id Type Size StoredAt
0 107 1 20 2019-08-01
4 107 3 20 2019-07-29
5 108 1 20 2019-08-01
7 108 3 20 2019-07-31
Upvotes: 1
Reputation: 291
It looks like you're just keeping the first example of a 1 or a 3 within that group. In that case you can use the following to give your desired output:
df.groupby(['Id', 'Type']).first().reset_index()
This assumes that your dataframe is sorted on the "StoredAt" column. If it's not you will need to do a sort before hand.
An alternative would be to use drop_duplicates using only the subset of columns which should be unique combinations (Id and Type) in this case. This would be:
df.drop_duplicates(subset=['Id', 'Type'])
Upvotes: 1