aahrens
aahrens

Reputation: 5590

DataFrame groupBy when each group has a difference

I'm attempting to groupby several columns in a DataFrame and only keep the rows where a specific column changed values.

Here is a sample of the dataset

Id  Type    Size    StoredAt
107  1       20    2019-08-01
107  1       20    2019-08-02
107  1       20    2019-07-31
107  1       20    2019-07-30
107  3       20    2019-07-29
108  1       20    2019-08-01
108  1       20    2019-08-02
108  3       20    2019-07-31

What I'm attempting to do is groupby so I only have two rows that shows when the type went from a 1 to a three. Here is what I want the DataFrame to look like.

Id  Type    Size    StoredAt
107  1       20    2019-08-01
107  3       20    2019-07-29
108  1       20    2019-08-01
108  3       20    2019-07-31

All I have so far

dataFrame.groupby(['Id', 'Type', 'Size'])['Id', 'Type', 'Size', 'StoredAt']

Which isn't returning what I'm expecting.

Upvotes: 1

Views: 151

Answers (2)

BENY
BENY

Reputation: 323226

I think drop_duplicates work for you

df.drop_duplicates(['Id','Type'])
Out[256]: 
    Id  Type  Size    StoredAt
0  107     1    20  2019-08-01
4  107     3    20  2019-07-29
5  108     1    20  2019-08-01
7  108     3    20  2019-07-31

Upvotes: 1

the_martian
the_martian

Reputation: 291

It looks like you're just keeping the first example of a 1 or a 3 within that group. In that case you can use the following to give your desired output:

df.groupby(['Id', 'Type']).first().reset_index()

This assumes that your dataframe is sorted on the "StoredAt" column. If it's not you will need to do a sort before hand.

An alternative would be to use drop_duplicates using only the subset of columns which should be unique combinations (Id and Type) in this case. This would be:

df.drop_duplicates(subset=['Id', 'Type'])

Upvotes: 1

Related Questions