What is the most efficient way to get count of distinct values in a pandas dataframe?

I have a dataframe as shown below.

    0   1   2
0   A   B   C
1   B   C   B
2   B   D   E
3   C   E   E
4   B   F   A

I need to get count of unique values from the entire dataframe, not column-wise unique values. In the above dataframe, unique values are A, B, C, D, E, F. So, the result I need is 6.

I'm achieving this using pandas squeeze, ravel and nunique functions, which converts entire dataframe into a series.

pd.Series(df.squeeze().values.ravel()).nunique(dropna=True)

Please let me know if there is any better way to achieve this.

Upvotes: 4

Answers (3)

MrNobody33

Reputation: 6483

You can use set, len and flatten too:

len(set(df.values.flatten()))

Out:

Timings: With a dummy dataframe with 6 unique values

#dummy data
df = pd.DataFrame({'Day':np.random.choice(['aa','bbbb','c','ddddd','EeeeE','xxx'], 10**6),'Heloo':np.random.choice(['aa','bbbb','c','ddddd','EeeeE','xxx'], 10**6)})


print(df.shape)
(1000000, 2)


%timeit len(set(df.values.flatten()))

>>>89.5 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


%timeit np.unique(df.values).shape[0]

>>>1.61 s ± 25.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


%timeit len(np.unique(df))

>>>1.85 s ± 229 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 1

jezrael

Reputation: 863791

Use numpy.unique with length of unique values:

out = len(np.unique(df))
6

Upvotes: 4

Rahul Vishwakarma

Reputation: 1456

Use NumPy for this, as:

import numpy as np
print(np.unique(df.values).shape[0])

Upvotes: 4

What is the most efficient way to get count of distinct values in a pandas dataframe?

Answers (3)

Related Questions