Reputation: 83
We have the following dataframe,
df = pd.DataFrame(data = {'A': [1,2,3,3,2,4,5,3],
'B': [9,6,7,9,2,5,3,3],
'C': [4,4,4,5,9,3,2,1]})
df
I want to create a new dataframe where for every column name will show the number of duplicates.
eg. 'B'
, has two values that are duplicated (9 and 3), I want to print 2 etc
Upvotes: 0
Views: 2609
Reputation: 62453
pandas.Series.value_counts
for each column, and then get the pandas.Series.sum
where the value count is greater than 1
vc[vc.gt(1)]
creates a pandas.Series
with the counts, for each value in a column, that're greater than 1.%%timeit
comparison for 5 columns of 1M rows, .apply
with vectorized methods, as well as the for-loop
and dict-comprehension
, are faster than using .apply
with built-in python sum(...)
..apply
with .value_counts
and .sum
col.value_counts().gt(1)
creates a Boolean
Series
True
evaluates as 1 and False
as 0, so .sum()
produces a correct result.dupe_count = df.agg(lambda col: col.value_counts().gt(1).sum())
A 2
B 2
C 1
dtype: int64
for-loop
.apply
.def col_vc(df):
dupe_count = dict()
for col in df.columns:
dupe_count[col] = df[col].value_counts().gt(1).sum()
return dupe_count
col_vc(df)
[result]:
{'A': 2, 'B': 2, 'C': 1}
dict-comprehension
dupe_count = {col: df[col].value_counts().gt(1).sum() for col in df.columns}
[result]:
{'A': 2, 'B': 2, 'C': 1}
# to a dataframe if desired
dupe_count = pd.DataFrame.from_dict(dupe_count, orient='index')
0
A 2
B 2
C 1
%%timeit
comparisonimport pandas as pd
import numpy as np
# sample data 5 columns by 1M rows
np.random.seed(365)
rows = 1000000
data = {'a': np.random.randint(0, 10000, size=(rows)),
'b': np.random.randint(15, 25000, size=(rows)),
'c': np.random.randint(30, 40000, size=(rows)),
'd': np.random.randint(450, 550000, size=(rows)),
'e': np.random.randint(6000, 70000, size=(rows))}
df = pd.DataFrame(data)
.apply
with .value_counts
and .sum
%%timeit
df.agg(lambda x: x.value_counts().gt(1).sum())
[out]:
112 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
dict-comprehension
%%timeit
{col: df[col].value_counts().gt(1).sum() for col in df.columns}
[out]:
111 ms ± 983 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
for-loop
%%timeit
col_vc(df)
[out]:
115 ms ± 4.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
.apply
with sum()
%%timeit
df.agg(lambda x: sum(x.value_counts() > 1))
[out]:
194 ms ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 0
Reputation: 7627
if we need to calculate the number of duplicate values
import pandas as pd
df = pd.DataFrame(data = {'A': [1,2,3,3,2,4,5,3],
'B': [9,6,7,9,2,5,3,3],
'C': [4,4,4,5,9,3,2,1]})
df1 = df.apply(lambda x:sum(x.duplicated()))
print(df1)
Prints:
A 3
B 2
C 2
dtype: int64
if we need to calculate the number of values that have duplicates
df1 = df.agg(lambda x: sum(x.value_counts() > 1)) # or df1 = df.apply(lambda x: sum(x.value_counts() > 1))
print(df1)
Prints:
A 2
B 2
C 1
dtype: int64
detailed
df1 = df.apply(lambda x: ' '.join([f'[val = {i}, cnt = {v}]' for i, v in x.value_counts().iteritems() if v > 1]))
print(df1)
Prints:
A [val = 3, cnt = 3] [val = 2, cnt = 2]
B [val = 9, cnt = 2] [val = 3, cnt = 2]
C [val = 4, cnt = 3]
dtype: object
Upvotes: 3
Reputation: 260975
You can use collections.Counter
and itertools.takewhile
:
from collections import Counter
from itertools import takewhile
df.apply(lambda c: len(list(takewhile(lambda x: x[1]>1, Counter(c).most_common()))))
output:
A 2
B 2
C 1
If you want the output as a dataframe, add .to_frame(name='n_duplicates')
:
output:
n_duplicates
A 2
B 2
C 1
For each column, Counter
gets the count of each element and most_common
returns them with the most commons first.
takewhile
iterates over this input and stops as soon as there is one element below the threshold (here 1).
Finally, we get the length of this output, which corresponds to the number of duplicated groups.
Upvotes: -1
Reputation: 24049
If you want count duplicated by each element you can use this:
import pandas as pd
from collections import Counter
df = pd.DataFrame(data = {'A': [1,2,3,3,2,4,5,3],
'B': [9,6,7,9,2,5,3,3],
'C': [4,4,4,5,9,3,2,1]})
def cnt(x):
return {k:v for k,v in x.items() if v>1}
df.apply(lambda x : cnt(Counter(x)))
Output:
A {2: 2, 3: 3}
B {9: 2, 3: 2}
C {4: 3}
dtype: object
Upvotes: 0