Giannis
Giannis

Reputation: 83

How to count the unique duplicate values in each column

We have the following dataframe,

df = pd.DataFrame(data = {'A': [1,2,3,3,2,4,5,3],
                     'B': [9,6,7,9,2,5,3,3],
                     'C': [4,4,4,5,9,3,2,1]})
df

I want to create a new dataframe where for every column name will show the number of duplicates.

eg. 'B', has two values that are duplicated (9 and 3), I want to print 2 etc

Upvotes: 0

Views: 2609

Answers (4)

Trenton McKinney
Trenton McKinney

Reputation: 62453

  • This can be done by getting the pandas.Series.value_counts for each column, and then get the pandas.Series.sum where the value count is greater than 1
    • vc[vc.gt(1)] creates a pandas.Series with the counts, for each value in a column, that're greater than 1.
  • We can see from the %%timeit comparison for 5 columns of 1M rows, .apply with vectorized methods, as well as the for-loop and dict-comprehension, are faster than using .apply with built-in python sum(...).

.apply with .value_counts and .sum

  • col.value_counts().gt(1) creates a Boolean Series
    • True evaluates as 1 and False as 0, so .sum() produces a correct result.
dupe_count = df.agg(lambda col: col.value_counts().gt(1).sum())

A    2
B    2
C    1
dtype: int64

for-loop

  • It's not typically recommended to iterate through a dataframe, especially row-wise. However, we're iterating through the columns, and then applying a vectorized function, which is equivalent to what is happening with .apply.
def col_vc(df):
    dupe_count = dict()
    for col in df.columns:
        dupe_count[col] = df[col].value_counts().gt(1).sum()
    return dupe_count


col_vc(df)

[result]:
{'A': 2, 'B': 2, 'C': 1}
  • The equivalent one-liner dict-comprehension
dupe_count = {col: df[col].value_counts().gt(1).sum() for col in df.columns}

[result]:
{'A': 2, 'B': 2, 'C': 1}

# to a dataframe if desired
dupe_count = pd.DataFrame.from_dict(dupe_count, orient='index')

   0
A  2
B  2
C  1

%%timeit comparison

import pandas as pd
import numpy as np

# sample data 5 columns by 1M rows
np.random.seed(365)
rows = 1000000
data = {'a': np.random.randint(0, 10000, size=(rows)),
        'b': np.random.randint(15, 25000, size=(rows)),
        'c': np.random.randint(30, 40000, size=(rows)),
        'd': np.random.randint(450, 550000, size=(rows)),
        'e': np.random.randint(6000, 70000, size=(rows))}
df = pd.DataFrame(data)
  • .apply with .value_counts and .sum
%%timeit
df.agg(lambda x: x.value_counts().gt(1).sum())
[out]:
112 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • dict-comprehension
%%timeit
{col: df[col].value_counts().gt(1).sum() for col in df.columns}
[out]:
111 ms ± 983 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • for-loop
%%timeit
col_vc(df)
[out]:
115 ms ± 4.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.agg(lambda x: sum(x.value_counts() > 1))
[out]:
194 ms ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 0

Алексей Р
Алексей Р

Reputation: 7627

Option 1

if we need to calculate the number of duplicate values

import pandas as pd

df = pd.DataFrame(data = {'A': [1,2,3,3,2,4,5,3],
                     'B': [9,6,7,9,2,5,3,3],
                     'C': [4,4,4,5,9,3,2,1]})

df1 = df.apply(lambda x:sum(x.duplicated()))
print(df1)

Prints:

A    3
B    2
C    2
dtype: int64

Option 2

if we need to calculate the number of values that have duplicates

df1 = df.agg(lambda x: sum(x.value_counts() > 1)) # or df1 = df.apply(lambda x: sum(x.value_counts() > 1))
print(df1)

Prints:

A    2
B    2
C    1
dtype: int64

Option 2.1

detailed

df1 = df.apply(lambda x: ' '.join([f'[val = {i}, cnt = {v}]' for i, v in x.value_counts().iteritems() if v > 1]))
print(df1)

Prints:

A    [val = 3, cnt = 3] [val = 2, cnt = 2]
B    [val = 9, cnt = 2] [val = 3, cnt = 2]
C                       [val = 4, cnt = 3]
dtype: object

Upvotes: 3

mozway
mozway

Reputation: 260975

You can use collections.Counter and itertools.takewhile:

from collections import Counter
from itertools import takewhile
df.apply(lambda c: len(list(takewhile(lambda x: x[1]>1, Counter(c).most_common()))))

output:

A    2
B    2
C    1

If you want the output as a dataframe, add .to_frame(name='n_duplicates'): output:

   n_duplicates
A             2
B             2
C             1

How it works

For each column, Counter gets the count of each element and most_common returns them with the most commons first.

takewhile iterates over this input and stops as soon as there is one element below the threshold (here 1).

Finally, we get the length of this output, which corresponds to the number of duplicated groups.

Upvotes: -1

I'mahdi
I'mahdi

Reputation: 24049

If you want count duplicated by each element you can use this:

import pandas as pd
from collections import Counter

df = pd.DataFrame(data = {'A': [1,2,3,3,2,4,5,3],
                     'B': [9,6,7,9,2,5,3,3],
                     'C': [4,4,4,5,9,3,2,1]})
def cnt(x):
    return {k:v for k,v in x.items() if v>1}
        

df.apply(lambda x : cnt(Counter(x)))

Output:

A    {2: 2, 3: 3}
B    {9: 2, 3: 2}
C          {4: 3}
dtype: object

Upvotes: 0

Related Questions