Dance Party2
Dance Party2

Reputation: 7536

Pandas distinct count column

Inspired by this post, I would like to get a distinct count of a value in a data frame per a grouping and create a column with the distinct count values in the data frame. Like this:

Original data frame:

import pandas as pd
df = pd.DataFrame(
    {'A' : ['foo', 'foo', 'foo', 'foo',
            'bar', 'bar', 'bar', 'bar'],
     'B' : ['foo', 'fo', 'foo', 'foo',
        'bar', 'bar', 'ba', 'ba'],
     'C' : [2, 4, 4, 2, 5, 4, 3, 2]})
df

    A       B       C
0   foo     foo     2
1   foo     fo      4
2   foo     foo     4
3   foo     foo     2
4   bar     bar     5
5   bar     bar     4
6   bar     ba      3
7   bar     ba      2

Method from linked post applied:

df=df.groupby(['A','B'])['C'].apply(lambda x: len(x.unique()))
df

Result per linked post method:

A    B  
bar  ba     2
     bar    2
foo  fo     1
     foo    2
Name: C, dtype: int64

Desired result:

    A       B       C   Distinct Count of C per A and B
0   foo     foo     2             2 
1   foo     fo      4             1 
2   foo     foo     4             2 
3   foo     foo     2             2
4   bar     bar     5             2 
5   bar     bar     4             2
6   bar     ba      3             2
7   bar     ba      2             2 

Looking at the first row, the combination of "foo" in "A" and "foo" in "B" has 2 unique values associated with it (2 and 4), resulting in a 2 in each row for that combination of values for columns A and B.

Thank in advance!

Upvotes: 2

Views: 615

Answers (1)

Anton Protopopov
Anton Protopopov

Reputation: 31682

Use transform instead of apply because it return column with the same size as original, I couldn't find documentation on the original pandas site for that, but from help:

transform(func, *args, **kwargs) method of pandas.core.groupby.SeriesGroupBy instance
Call function producing a like-indexed Series on each group and return a Series with the transformed values

df['Distinct Count of C per A and B'] = df.groupby(['A','B'])['C'].transform(lambda x: len(x.unique()))

In [1495]: df
Out[1495]: 
     A    B  C  Distinct Count of C per A and B
0  foo  foo  2                                2
1  foo   fo  4                                1
2  foo  foo  4                                2
3  foo  foo  2                                2
4  bar  bar  5                                2
5  bar  bar  4                                2
6  bar   ba  3                                2
7  bar   ba  2                                2

Upvotes: 1

Related Questions