Reputation: 1111
Here is what I have:
import pandas as pd
df = pd.DataFrame()
df['date'] = ['2020-01-01', '2020-01-01','2020-01-01', '2020-01-02', '2020-01-02', '2020-01-03', '2020-01-03']
df['value'] = ['A', 'A', 'A', 'A', 'B', 'A', 'C']
df
date value
0 2020-01-01 A
1 2020-01-01 A
2 2020-01-01 A
3 2020-01-02 A
4 2020-01-02 B
5 2020-01-03 A
6 2020-01-03 C
I want to aggregate unique values over time like this:
date value
0 2020-01-01 1
3 2020-01-02 2
5 2020-01-03 3
I am NOT looking for this as an answer:
date value
0 2020-01-01 1
3 2020-01-02 2
5 2020-01-03 2
I need the 2020-01-03
to be 3
because there are three unique values (A,B,C).
Upvotes: 1
Views: 224
Reputation: 153460
Let's use pd.crosstab instead:
(pd.crosstab(df['date'], df['value']) !=0).cummax().sum(axis=1)
Output:
date
2020-01-01 1
2020-01-02 2
2020-01-03 3
dtype: int64
Details:
First, let's reshape the dataframe such that you have 'date' as rows and the values listed across as columns. Then check for non-zero cells and use cummax in the column to keep track of every "value" seen in a column, then use sum across rows to calculate how many distinct values are seen at any point in time in the dataframe.
Upvotes: 4
Reputation: 26676
I think,np.cumsum
the first unique values. .groupby
the date
which in this case I have set as the index
and find either the maximum or last value.
import numpy as np
(np.cumsum((~(df.set_index('date')).duplicated('value')))).groupby(level=0).max()
date
2020-01-01 1
2020-01-02 2
2020-01-03 3
Upvotes: 1
Reputation: 323226
We can do agg
list
with cumsum
s=df.groupby('date').value.agg(list).cumsum().map(set).map(len)
date
2020-01-01 1
2020-01-02 2
2020-01-03 3
Name: value, dtype: int64
Upvotes: 6