Reputation: 131088
I have a pandas data frame. I want to group it by using one combination of columns and count distinct values of another combination of columns.
For example I have the following data frame:
a b c d e
0 1 10 100 1000 10000
1 1 10 100 1000 20000
2 1 20 100 1000 20000
3 1 20 100 2000 20000
I can group it by columns a
and b
and count distinct values in the column d
:
df.groupby(['a','b'])['d'].nunique().reset_index()
As a result I get:
a b d
0 1 10 1
1 1 20 2
However, I would like to count distinct values in a combination of columns. For example if I use c
and d
, then in the first group I have only one unique combination ((100, 1000)
) while in the second group I have two distinct combinations: (100, 1000)
and (100, 2000)
.
The following naive "generalization" does not work:
df.groupby(['a','b'])[['c','d']].nunique().reset_index()
because nunique()
is not applicable to data frames.
Upvotes: 8
Views: 7791
Reputation:
You can extend your original concept...
df.groupby(['a', 'b', 'c']).d.nunique()
a b c
1 10 100 1
20 100 2
Name: d, dtype: int64
You can drop the c
in the index and focus on just the information you want.
df.groupby(['a', 'b', 'c']).d.nunique().reset_index('c', drop=True)
a b
1 10 1
20 2
Name: d, dtype: int64
Upvotes: 1
Reputation: 862661
You can create combination of values converting to string
to new column e
and then use SeriesGroupBy.nunique
:
df['e'] = df.c.astype(str) + df.d.astype(str)
df = df.groupby(['a','b'])['e'].nunique().reset_index()
print (df)
a b e
0 1 10 1
1 1 20 2
You can also use Series
without creating new column:
df =(df.c.astype(str)+df.d.astype(str)).groupby([df.a, df.b]).nunique().reset_index(name='f')
print (df)
a b f
0 1 10 1
1 1 20 2
Another posible solution is create tuples:
df=(df[['c','d']].apply(tuple, axis=1)).groupby([df.a, df.b]).nunique().reset_index(name='f')
print (df)
a b f
0 1 10 1
1 1 20 2
Another numpy solution by this answer:
def f(x):
a = x.values
c = len(np.unique(np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1]))), return_counts=True)[1])
return c
print (df.groupby(['a','b'])[['c','d']].apply(f))
Timings:
#[1000000 rows x 5 columns]
np.random.seed(123)
N = 1000000
df = pd.DataFrame(np.random.randint(30, size=(N,5)))
df.columns = list('abcde')
print (df)
In [354]: %timeit (df.groupby(['a','b'])[['c','d']].apply(lambda g: len(g) - g.duplicated().sum()))
1 loop, best of 3: 663 ms per loop
In [355]: %timeit (df.groupby(['a','b'])[['c','d']].apply(f))
1 loop, best of 3: 387 ms per loop
In [356]: %timeit (df.groupby(['a', 'b', 'c', 'd']).size().groupby(level=['a', 'b']).size())
1 loop, best of 3: 441 ms per loop
In [357]: %timeit ((df.c.astype(str)+df.d.astype(str)).groupby([df.a, df.b]).nunique())
1 loop, best of 3: 4.95 s per loop
In [358]: %timeit ((df[['c','d']].apply(tuple, axis=1)).groupby([df.a, df.b]).nunique())
1 loop, best of 3: 17.6 s per loop
Upvotes: 11
Reputation: 294258
Don't stop the groupby at 'a', 'b'
, include everything you're looking at.
df.groupby(['a', 'b', 'c', 'd']).size()
a b c d
1 10 100 1000 2
20 100 1000 1
2000 1
dtype: int64
unstack
to get a different view
df.groupby(['a', 'b', 'c', 'd']).size().unstack(fill_value=0)
to actually get at the information you're looking for
df.groupby(['a', 'b', 'c', 'd']).size().groupby(level=['a', 'b']).size()
a b
1 10 1
20 2
dtype: int64
Says:
within the the group a is 1; b is 10
there are 1 unique combinations of c and d
.
within the the group a is 1; b is 20
there are 2 unique combinations of c and d
.
Upvotes: 7
Reputation: 16251
If you do not want to concatenate columns, you can apply a function that counts the number of non-duplicates:
df.groupby(['a','b'])[['c','d']].apply(lambda g: len(g) - g.duplicated().sum())
Upvotes: 3