Reputation: 33
I have a df like so:
df = pd.DataFrame({
'date': ['01/01/2020', '01/01/2020', '01/01/2020', '02/01/2020', '02/01/2020', '03/01/2020', '03/01/2020'],
'id': [101, 102, 103, 101, 104, 105, 106]
})
Output:
date id
0 01/01/2020 101
1 01/01/2020 102
2 01/01/2020 103
3 02/01/2020 101
4 02/01/2020 104
5 03/01/2020 105
6 03/01/2020 106
I require a cumulative count of the distinct values like so:
date id
0 01/01/2020 3
1 02/01/2020 4
2 03/01/2020 6
I have tried things like df.groupby(['date']).nunique() but obviously that's not right as it gives the unique count for each date, it doesn't have a rolling unique count as I require.
Upvotes: 3
Views: 1615
Reputation: 862406
I believe is necesary first remove duplicates per id
by DataFrame.drop_duplicates
, then get counts per date
s by GroupBy.size
and add cumulative sum by Series.cumsum
:
df = df.drop_duplicates('id').groupby('date').size().cumsum().reset_index(name='id')
print (df)
date id
0 01/01/2020 3
1 02/01/2020 4
2 03/01/2020 6
Upvotes: 4
Reputation: 30920
or we can use DataFrame.duplicated
:
(~df.duplicated('id')).groupby(df['date']).sum().cumsum().rename('id').reset_index()
date id
0 01/01/2020 3.0
1 02/01/2020 4.0
2 03/01/2020 6.0
Upvotes: 3