max
max

Reputation: 52253

COUNT DISTINCT / nunique within groups

I want to count the number of distinct tuples within each group:

df = pd.DataFrame({'a': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
              'b': [1, 2, 1, 2, 1, 2, 1, 2],
              'c': [1, 1, 2, 2, 2, 1, 2, 1]})
counts = count_distinct(df, by='a', columns=['b', 'c'])
assert counts == pd.Series([4, 2], index=['A', 'B'])

In other words, counts should report that for group 'A', there are four distinct tuples and for group 'B', there are two.

I tried using df.groupby('a')['b', 'c'].nunique(), but nunique works only with a single column.

I know I could count distinct tuples by df.groupby(['b', 'c']), but that means I have use a slow apply with a pure python function (the number of groups of column 'a' is large).

I could convert the 'b' and 'c' columns into a single column of tuples, but that would be super slow since it will no longer use vectorized operations.

Upvotes: 1

Views: 659

Answers (1)

akuiper
akuiper

Reputation: 214957

I think your logic is equivalent to count the size of data frames grouped by column a after dropping the duplicated values of combined columns a, b and c, since duplicated tuples within each group must also be duplicated records in the data frame assuming your data frame contains only columns a, b and c and vice versa:

df.drop_duplicates().groupby('a').size()

# a
# A    4
# B    2
# dtype: int64

Upvotes: 3

Related Questions