How to count cumulative unique values by group?

Question

I wonder how to count accumulative unique values by groups in python?

Below is the dataframe example:

Group	Year	Type
A	1998	red
A	1998	blue
A	2002	red
A	2005	blue
A	2008	blue
A	2008	yello
B	1998	red
B	2001	red
B	2003	red
C	1996	red
C	2002	orange
C	2002	red
C	2012	blue
C	2012	yello

I need to create a new column by Column "Group". The value of this new column should be the accumulative unique values of Column "Type", accumulating by Column "Year".

Below is the dataframe I want. For example: (1)For Group A and in year 1998, I want to count the unique value of Type in year 1998, and there are two unique values of Type: red and blue. (2)For Group A and in year 2002, I want to count the unique value of Type in year 1998 and 2002, and there are also two unique values of Type: red and blue. (3)For Group A and in year 2008, I want to count the unique value of Type in year 1998, 2002, 2005, and 2008, and there are three unique values of Type: red, blue, and yellow.

Group	Year	Type	Want
A	1998	red	2
A	1998	blue	2
A	2002	red	2
A	2005	blue	2
A	2008	blue	3
A	2008	yello	3
B	1998	red	1
B	2001	red	1
B	2003	red	1
C	1996	red	1
C	2002	orange	2
C	2002	red	2
C	2012	blue	4
C	2012	yello	4

One more thing about this dataframe: not all groups have values in the same years. For example, group A has two values in year 1998 and 2008, one value in year 2002 and 2005. Group B has values in year 1998, 2001, and 2003.

I wonder how to address this problem. Your great help means a lot to me. Thanks!

Vladimir Fokow · Accepted Answer

For each Group:

Append a new column Want that has the values like you want:

def f(df):
    want = df.groupby('Year')['Type'].agg(list).cumsum().apply(set).apply(len)
    want.name = 'Want'
    return df.merge(want, on='Year')

df.groupby('Group', group_keys=False).apply(f).reset_index(drop=True)

Result:

   Group  Year    Type  Want
0      A  1998     red     2
1      A  1998    blue     2
2      A  2002     red     2
3      A  2005    blue     2
4      A  2008    blue     3
5      A  2008   yello     3
6      B  1998     red     1
7      B  2001     red     1
8      B  2003     red     1
9      C  1996     red     1
10     C  2002  orange     2
11     C  2002     red     2
12     C  2012    blue     4
13     C  2012   yello     4

Notes:

I think the use of .merge here is efficient.

You can also use 1 .apply inside f instead of 2 chained ones to improve efficiency: .apply(lambda x: len(set(x)))

How to count cumulative unique values by group?

Answers (1)

Related Questions