savi
savi

Reputation: 323

multiple groupby and get unique count

I have the following df1:

id period color size rate
1    01    red   12   30
1    02    red   12   30
2    01    blue  12   35
3    03    blue  12   35
4    01    blue  12   35
4    02    blue  12   35
5    01    pink  10   40
6    01    pink  10   40

I need to create an index column that is basically a group of color-size-rate and count the number of unique id(s) that have this combination. My output df after transformation should look like this df1:

id period color size rate      index     count
1    01    red   12   30    red-12-30     1
1    02    red   12   30    red-12-30     1
2    01    blue  12   35    blue-12-35    3
2    03    blue  12   35    blue-12-35    3
4    01    blue  12   35    blue-12-35    3
4    02    blue  12   35    blue-12-35    3
5    01    pink  10   40    pink-10-40    2
6    01    pink  10   40    pink-10-40    2

I am able to get the count but it is not counting "unique" ids but instead , just the number of occurrences.

1    01    red   12   30    red-12-30     2
1    02    red   12   30    red-12-30     2
2    01    blue  12   35    blue-12-35    4
2    03    blue  12   35    blue-12-35    4
4    01    blue  12   35    blue-12-35    4
4    02    blue  12   35    blue-12-35    4

This is wrong as it is not actually grouping by id to count unique ones.

Appreciate any pointers in this direction.

Adding an edit here as my requirement changed: The count need to also be grouped by 'period' ie my final df should be:

index       period   count
red-12-30    01        1
red-12-30    02        1
blue-12-35   01        2
blue-12-35   03        1
blue-12-35   02        1
pink-10-40   01        2

Solution: from @anky: When I try adding another groupby['period'], I am getting dimension mismatch error.

Thank you in advance.

Upvotes: 1

Views: 59

Answers (1)

anky
anky

Reputation: 75080

You can try aggregating a join for creating the index column, then group on the same and get nunique using groupby+transform

idx = df[['color','size','rate']].astype(str).agg('-'.join,1)
out = df.assign(index=idx,count=df.groupby(idx)['id'].transform('nunique'))

print(out)

   id  period color  size  rate       index  count
0   1       1   red    12    30   red-12-30      1
1   1       2   red    12    30   red-12-30      1
2   2       1  blue    12    35  blue-12-35      3
3   3       3  blue    12    35  blue-12-35      3
4   4       1  blue    12    35  blue-12-35      3
5   4       2  blue    12    35  blue-12-35      3
6   5       1  pink    10    40  pink-10-40      2
7   6       1  pink    10    40  pink-10-40      2

Upvotes: 2

Related Questions