GhostKU
GhostKU

Reputation: 2108

How to count unique records by two columns in pandas?

I have dataframe in pandas:

In [10]: df
Out[10]:
    col_a    col_b  col_c  col_d
0  France    Paris      3      4
1      UK    Londo      4      5
2      US  Chicago      5      6
3      UK  Bristol      3      3
4      US    Paris      8      9
5      US   London     44      4
6      US  Chicago     12      4

I need to count unique cities. I can count unique states

In [11]: df['col_a'].nunique()
Out[11]: 3

and I can try to count unique cities

In [12]: df['col_b'].nunique()
Out[12]: 5

but it is wrong because US Paris and Paris in France are different cities. So now I'm doing in like this:

In [13]: df['col_a_b'] = df['col_a'] + ' - ' + df['col_b']

In [14]: df
Out[14]:
    col_a    col_b  col_c  col_d         col_a_b
0  France    Paris      3      4  France - Paris
1      UK    Londo      4      5      UK - Londo
2      US  Chicago      5      6    US - Chicago
3      UK  Bristol      3      3    UK - Bristol
4      US    Paris      8      9      US - Paris
5      US   London     44      4     US - London
6      US  Chicago     12      4    US - Chicago

In [15]: df['col_a_b'].nunique()
Out[15]: 6

Maybe there is a better way? Without creating an additional column.

Upvotes: 39

Views: 59139

Answers (5)

Girish Bhasin
Girish Bhasin

Reputation: 29

import pandas as pd
data = {'field1':[1,4,1,68,9],'field2':[1,1,4,5,9]}
df = pd.DataFrame(data)
results = df.groupby('field1')['field2'].nunique()
results

Output:

field1
1     2
4     1
9     1
68    1
Name: field2, dtype: int64

Upvotes: 1

Aks
Aks

Reputation: 952

try this, I'm basically subtracting the number of duplicate groups from the number of rows in df. This is assuming we are grouping all the categories in the df

df.shape[0] - df[['col_a','col_b']].duplicated().sum()

774 µs ± 603 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 0

akuiper
akuiper

Reputation: 214987

You can select col_a and col_b, drop the duplicates, then check the shape/len of the result data frame:

df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 6

len(df[['col_a', 'col_b']].drop_duplicates())
# 6

Because groupby ignore NaNs, and may unnecessarily invoke a sorting process, choose accordingly which method to use if you have NaNs in the columns:

Consider a data frame as following:

df = pd.DataFrame({
    'col_a': [1,2,2,pd.np.nan,1,4],
    'col_b': [2,2,3,pd.np.nan,2,pd.np.nan]
})

print(df)

#   col_a  col_b
#0    1.0    2.0
#1    2.0    2.0
#2    2.0    3.0
#3    NaN    NaN
#4    1.0    2.0
#5    4.0    NaN

Timing:

df = pd.concat([df] * 1000)

%timeit df.groupby(['col_a', 'col_b']).ngroups
# 1000 loops, best of 3: 625 µs per loop

%timeit len(df[['col_a', 'col_b']].drop_duplicates())
# 1000 loops, best of 3: 1.02 ms per loop

%timeit df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 1000 loops, best of 3: 1.01 ms per loop    

%timeit len(set(zip(df['col_a'],df['col_b'])))
# 10 loops, best of 3: 56 ms per loop

%timeit len(df.groupby(['col_a', 'col_b']))
# 1 loop, best of 3: 260 ms per loop

Result:

df.groupby(['col_a', 'col_b']).ngroups
# 3

len(df[['col_a', 'col_b']].drop_duplicates())
# 5

df[['col_a', 'col_b']].drop_duplicates().shape[0]
# 5

len(set(zip(df['col_a'],df['col_b'])))
# 2003

len(df.groupby(['col_a', 'col_b']))
# 2003

So the difference:

Option 1:

df.groupby(['col_a', 'col_b']).ngroups

is fast, and it excludes rows that contain NaNs.

Option 2 & 3:

len(df[['col_a', 'col_b']].drop_duplicates())
df[['col_a', 'col_b']].drop_duplicates().shape[0]

Reasonably fast, it considers NaNs as a unique value.

Option 4 & 5:

len(set(zip(df['col_a'],df['col_b']))) 
len(df.groupby(['col_a', 'col_b'])) 

slow, and it is following the logic that numpy.nan == numpy.nan is False, so different (nan, nan) rows are considered different.

Upvotes: 23

BENY
BENY

Reputation: 323306

By using ngroups

df.groupby(['col_a', 'col_b']).ngroups
Out[101]: 6

Or using set

len(set(zip(df['col_a'],df['col_b'])))
Out[106]: 6

Upvotes: 58

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210872

In [105]: len(df.groupby(['col_a', 'col_b']))
Out[105]: 6

Upvotes: 5

Related Questions