Reputation: 52253
I want to count the number of distinct tuples within each group:
df = pd.DataFrame({'a': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'b': [1, 2, 1, 2, 1, 2, 1, 2],
'c': [1, 1, 2, 2, 2, 1, 2, 1]})
counts = count_distinct(df, by='a', columns=['b', 'c'])
assert counts == pd.Series([4, 2], index=['A', 'B'])
In other words, counts should report that for group 'A', there are four distinct tuples and for group 'B', there are two.
I tried using df.groupby('a')['b', 'c'].nunique()
, but nunique
works only with a single column.
I know I could count distinct tuples by df.groupby(['b', 'c']), but that means I have use a slow apply
with a pure python function (the number of groups of column 'a' is large).
I could convert the 'b' and 'c' columns into a single column of tuples, but that would be super slow since it will no longer use vectorized operations.
Upvotes: 1
Views: 659
Reputation: 214957
I think your logic is equivalent to count the size of data frames grouped by column a
after dropping the duplicated values of combined columns a
, b
and c
, since duplicated tuples within each group must also be duplicated records in the data frame assuming your data frame contains only columns a
, b
and c
and vice versa:
df.drop_duplicates().groupby('a').size()
# a
# A 4
# B 2
# dtype: int64
Upvotes: 3