Jim O.
Jim O.

Reputation: 1111

Cumulative aggregate of unique string values

Here is what I have:

import pandas as  pd
df = pd.DataFrame()
df['date'] = ['2020-01-01', '2020-01-01','2020-01-01', '2020-01-02', '2020-01-02', '2020-01-03', '2020-01-03']
df['value'] = ['A', 'A', 'A', 'A', 'B', 'A', 'C']
df
           date value
0   2020-01-01      A
1   2020-01-01      A
2   2020-01-01      A
3   2020-01-02      A
4   2020-01-02      B
5   2020-01-03      A
6   2020-01-03      C

I want to aggregate unique values over time like this:

           date value
0   2020-01-01      1
3   2020-01-02      2
5   2020-01-03      3

I am NOT looking for this as an answer:

           date value
0   2020-01-01      1
3   2020-01-02      2
5   2020-01-03      2

I need the 2020-01-03 to be 3 because there are three unique values (A,B,C).

Upvotes: 1

Views: 224

Answers (3)

Scott Boston
Scott Boston

Reputation: 153460

Let's use pd.crosstab instead:

(pd.crosstab(df['date'], df['value']) !=0).cummax().sum(axis=1)

Output:

date
2020-01-01    1
2020-01-02    2
2020-01-03    3
dtype: int64

Details:

First, let's reshape the dataframe such that you have 'date' as rows and the values listed across as columns. Then check for non-zero cells and use cummax in the column to keep track of every "value" seen in a column, then use sum across rows to calculate how many distinct values are seen at any point in time in the dataframe.

Upvotes: 4

wwnde
wwnde

Reputation: 26676

I think,np.cumsum the first unique values. .groupby the date which in this case I have set as the index and find either the maximum or last value.

import numpy as np
    (np.cumsum((~(df.set_index('date')).duplicated('value')))).groupby(level=0).max()

date
2020-01-01    1
2020-01-02    2
2020-01-03    3

Upvotes: 1

BENY
BENY

Reputation: 323226

We can do agg list with cumsum

s=df.groupby('date').value.agg(list).cumsum().map(set).map(len)
date
2020-01-01    1
2020-01-02    2
2020-01-03    3
Name: value, dtype: int64

Upvotes: 6

Related Questions