Reputation: 696
For each customer_id and partition_date, I would like to find the first date a given customer_id appears in the dataset. This is what I've tried so far, but I've been getting the error below:
df['first_seen_date'] = df.sort_values('partition_date').groupby('customer_id').first()
error: Wrong number of items passed 3, placement implies 1
df
customer_id partition_date
24242 01.01.2020
24242 02.01.2020
24242 04.01.2020
35439 06.01.2020
35439 05.01.2020
35439 07.01.2020
desired output df
customer_id first_seen_date
24242 01.01.2020
35439 05.01.2020
Upvotes: 1
Views: 32
Reputation: 862511
You are close, assign to new DataFrame
:
df1 = df.sort_values('partition_date').groupby('customer_id', as_index=False).first()
Or use DataFrame.drop_duplicates
:
df1 = df.sort_values('partition_date').drop_duplicates('customer_id')
Upvotes: 3