Reputation: 71
I wonder how to count accumulative unique values by groups in python?
Below is the dataframe example:
Group | Year | Type |
---|---|---|
A | 1998 | red |
A | 1998 | blue |
A | 2002 | red |
A | 2005 | blue |
A | 2008 | blue |
A | 2008 | yello |
B | 1998 | red |
B | 2001 | red |
B | 2003 | red |
C | 1996 | red |
C | 2002 | orange |
C | 2002 | red |
C | 2012 | blue |
C | 2012 | yello |
I need to create a new column by Column "Group". The value of this new column should be the accumulative unique values of Column "Type", accumulating by Column "Year".
Below is the dataframe I want. For example: (1)For Group A and in year 1998, I want to count the unique value of Type in year 1998, and there are two unique values of Type: red and blue. (2)For Group A and in year 2002, I want to count the unique value of Type in year 1998 and 2002, and there are also two unique values of Type: red and blue. (3)For Group A and in year 2008, I want to count the unique value of Type in year 1998, 2002, 2005, and 2008, and there are three unique values of Type: red, blue, and yellow.
Group | Year | Type | Want |
---|---|---|---|
A | 1998 | red | 2 |
A | 1998 | blue | 2 |
A | 2002 | red | 2 |
A | 2005 | blue | 2 |
A | 2008 | blue | 3 |
A | 2008 | yello | 3 |
B | 1998 | red | 1 |
B | 2001 | red | 1 |
B | 2003 | red | 1 |
C | 1996 | red | 1 |
C | 2002 | orange | 2 |
C | 2002 | red | 2 |
C | 2012 | blue | 4 |
C | 2012 | yello | 4 |
One more thing about this dataframe: not all groups have values in the same years. For example, group A has two values in year 1998 and 2008, one value in year 2002 and 2005. Group B has values in year 1998, 2001, and 2003.
I wonder how to address this problem. Your great help means a lot to me. Thanks!
Upvotes: 2
Views: 73
Reputation: 3883
For each Group
:
Append a new column Want
that has the values like you want:
def f(df):
want = df.groupby('Year')['Type'].agg(list).cumsum().apply(set).apply(len)
want.name = 'Want'
return df.merge(want, on='Year')
df.groupby('Group', group_keys=False).apply(f).reset_index(drop=True)
Result:
Group Year Type Want
0 A 1998 red 2
1 A 1998 blue 2
2 A 2002 red 2
3 A 2005 blue 2
4 A 2008 blue 3
5 A 2008 yello 3
6 B 1998 red 1
7 B 2001 red 1
8 B 2003 red 1
9 C 1996 red 1
10 C 2002 orange 2
11 C 2002 red 2
12 C 2012 blue 4
13 C 2012 yello 4
Notes:
I think the use of
.merge
here is efficient.You can also use 1
.apply
insidef
instead of 2 chained ones to improve efficiency:.apply(lambda x: len(set(x)))
Upvotes: 1