Reputation: 5411
I am trying to count the duplicates of each type of row in my dataframe. For example, say that I have a dataframe in pandas as follows:
df = pd.DataFrame({'one': pd.Series([1., 1, 1]),
'two': pd.Series([1., 2., 1])})
I get a df that looks like this:
one two
0 1 1
1 1 2
2 1 1
I imagine the first step is to find all the different unique rows, which I do by:
df.drop_duplicates()
This gives me the following df:
one two
0 1 1
1 1 2
Now I want to take each row from the above df ([1 1] and [1 2]) and get a count of how many times each is in the initial df. My result would look something like this:
Row Count
[1 1] 2
[1 2] 1
How should I go about doing this last step?
Edit:
Here's a larger example to make it more clear:
df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
'two': pd.Series([True, False, False, True]),
'three': pd.Series([True, False, False, False])})
gives me:
one three two
0 True True True
1 True False False
2 True False False
3 False False True
I want a result that tells me:
Row Count
[True True True] 1
[True False False] 2
[False False True] 1
Upvotes: 126
Views: 327416
Reputation: 21
df.groupby(df.columns.tolist()).size().reset_index(name='count')
one two count
0 1 1 2
1 1 2 1
Upvotes: 0
Reputation: 1104
It is as easy as:
df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
'two': pd.Series([True, False, False, True]),
'three': pd.Series([True, False, False, False])})
rs = pd.DataFrame(df.value_counts(sort=False).index.to_list(), columns=df.columns)
rs["#"] = df.value_counts(sort=False).values
one two three #
0 False True False 1
1 True False False 2
2 True True True 1
But, if you want to just inform this piece of code is enough:
df.value_counts(sort=False)
Upvotes: 0
Reputation: 131
If you just need to find a count the for unique and duplicate rows (entire row duplicated) this could work:
df.duplicated().value_counts()
output: False 11398 True 154 dtype: int64
Upvotes: 12
Reputation: 23237
If you find some counts missing or get error: ValueError: Length mismatch: Expected axis has nnnn elements, new values have mmmm elements
, read here:
NaN
entries:The accepted solution is great and believed to have been helpful to many members. In a recent task, I found it can be further fine-tuned to support complete counting of a dataframe with NaN
entries. Pandas supports missing entries or null values as NaN
values. Let's see what's the output for this use case when our dataframe contains NaN
entries:
Col1 Col2 Col3 Col4
0 ABC 123 XYZ NaN # group #1 of 3
1 ABC 123 XYZ NaN # group #1 of 3
2 ABC 678 PQR def # group #2 of 1
3 MNO 890 EFG abc # group #3 of 4
4 MNO 890 EFG abc # group #3 of 4
5 CDE 234 567 xyz # group #4 of 2
6 ABC 123 XYZ NaN # group #1 of 3
7 CDE 234 567 xyz # group #4 of 2
8 MNO 890 EFG abc # group #3 of 4
9 MNO 890 EFG abc # group #3 of 4
Applying the code:
df.groupby(df.columns.tolist(),as_index=False).size()
gives:
Col1 Col2 Col3 Col4 size
0 ABC 678 PQR def 1
1 CDE 234 567 xyz 2
2 MNO 890 EFG abc 4
Oh, how come the count of Group #1 with 3 duplicate rows is missing?!
For some Pandas version, you may get an error instead: ValueError: Length mismatch: Expected axis has nnnn elements, new values have mmmm elements
Use the parameter dropna=
for the .groupby()
function, as follows:
df.groupby(df.columns.tolist(), as_index=False, dropna=False).size()
gives:
Col1 Col2 Col3 Col4 size
0 ABC 123 XYZ NaN 3 # <=== count of rows with `NaN`
1 ABC 678 PQR def 1
2 CDE 234 567 xyz 2
3 MNO 890 EFG abc 4
The count of duplicate rows with NaN
can be successfully output with dropna=False
. This parameter has been supported since Pandas version 1.1.0
Another way to count duplicate rows with NaN
entries is as follows:
df.value_counts(dropna=False).reset_index(name='count')
gives:
Col1 Col2 Col3 Col4 count
0 MNO 890 EFG abc 4
1 ABC 123 XYZ NaN 3
2 CDE 234 567 xyz 2
3 ABC 678 PQR def 1
Here, we use the .value_counts()
function with also the parameter dropna=False
. However, this parameter has been supported only recently since Pandas version 1.3.0 If your version is older than this, you'll need to use the .groupby()
solution if you want to get complete counts for rows with NaN
entries.
You will see that the output is in different sequence than the previous result. The counts are sorted in descending order. If you want to get unsorted result, you can specify sort=False
:
df.value_counts(dropna=False, sort=False).reset_index(name='count')
it gives the same result as the df.groupby(df.columns.tolist(), as_index=False, dropna=False).size()
solution:
Col1 Col2 Col3 Col4 count
0 ABC 123 XYZ NaN 3
1 ABC 678 PQR def 1
2 CDE 234 567 xyz 2
3 MNO 890 EFG abc 4
Note that this .value_counts()
solution supports dataframes both with and without NaN
entries and can be used as a general solution.
In fact, in the underlying implementation codes .value_counts()
calls GroupBy.size
to get the counts: click the link to see the underlying codes: counts = self.groupby(subset, dropna=dropna).grouper.size()
Hence, for this use case, .value_counts()
and the .groupby()
solution in the accepted solution are actually doing the same thing. We should be able to use the .value_counts()
function to get the desired counts of duplicate rows equally well.
Use of .value_counts()
function to get counts of duplicate rows has the additional benefit that its syntax is simpler. You can simply use df.value_counts()
or df.value_counts(dropna=False)
depending on whether your dataframe contains NaN
or not. Chain with .reset_index()
if you want the result as a dataframe instead of a Series.
Upvotes: 6
Reputation: 1064
Specific to your question, as the others mentioned fast and easy way would be:
df.groupby(df.columns.tolist(),as_index=False).size()
If you like to count duplicates on particular column(s):
len(df['one'])-len(df['one'].drop_duplicates())
If you want to count duplicates on entire dataframe:
len(df)-len(df.drop_duplicates())
Or simply you can use DataFrame.duplicated(subset=None, keep='first'):
df.duplicated(subset='one', keep='first').sum()
where
subset : column label or sequence of labels(by default use all of the columns)
keep : {‘first’, ‘last’, False}, default ‘first’
Upvotes: 61
Reputation: 17911
To count rows in DataFrame you can use the method value_counts
(Pandas 1.1.0):
df = pd.DataFrame({'A': [1, 1, 2, 2, 3], 'B': [10, 10, 20, 20, 30]})
df.value_counts().reset_index(name='counts').query('counts > 1')
Output:
A B counts
0 1 10 2
1 2 20 2
Upvotes: 0
Reputation: 21
ran into this problem today and wanted to include NaNs so I replace them temporarily with "" (empty string). Please comment if you do not understand something :). This solution assumes that "" is not a relevant value for you. It should also work with numerical data (I have tested it sucessfully but not extensively) since pandas will infer the data type again after replacing "" with np.nan.
import pandas as pd
# create test data
df = pd.DataFrame({'test':['foo','bar',None,None,'foo'],
'test2':['bar',None,None,None,'bar'],
'test3':[None, 'foo','bar',None,None]})
# fill null values with '' to not lose them during groupby
# groupby all columns and calculate the length of the resulting groups
# rename the series obtained with groupby to "group_count"
# reset the index to get a DataFrame
# replace '' with np.nan (this reverts our first operation)
# sort DataFrame by "group_count" descending
df = (df.fillna('')\
.groupby(df.columns.tolist()).apply(len)\
.rename('group_count')\
.reset_index()\
.replace('',np.nan)\
.sort_values(by = ['group_count'], ascending = False))
df
test test2 test3 group_count
3 foo bar NaN 2
0 NaN NaN NaN 1
1 NaN NaN bar 1
2 bar NaN foo 1
Upvotes: 2
Reputation: 585
I use:
used_features =[
"one",
"two",
"three"
]
df['is_duplicated'] = df.duplicated(used_features)
df['is_duplicated'].sum()
which gives count of duplicated rows, and then you can analyse them by a new column. I didn't see such solution here.
Upvotes: 13
Reputation: 511
df.groupby(df.columns.tolist()).size().reset_index().\
rename(columns={0:'records'})
one two records
0 1 1 2
1 1 2 1
Upvotes: 51
Reputation: 452
None of the existing answers quite offers a simple solution that returns "the number of rows that are just duplicates and should be cut out". This is a one-size-fits-all solution that does:
# generate a table of those culprit rows which are duplicated:
dups = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'})
# sum the final col of that table, and subtract the number of culprits:
dups['count'].sum() - dups.shape[0]
Upvotes: 8
Reputation: 18983
df = pd.DataFrame({'one' : pd.Series([1., 1, 1, 3]), 'two' : pd.Series([1., 2., 1, 3] ), 'three' : pd.Series([1., 2., 1, 2] )})
df['str_list'] = df.apply(lambda row: ' '.join([str(int(val)) for val in row]), axis=1)
df1 = pd.DataFrame(df['str_list'].value_counts().values, index=df['str_list'].value_counts().index, columns=['Count'])
Produces:
>>> df1
Count
1 1 1 2
3 2 3 1
1 2 2 1
If the index values must be a list, you could take the above code a step further with:
df1.index = df1.index.str.split()
Produces:
Count
[1, 1, 1] 2
[3, 2, 3] 1
[1, 2, 2] 1
Upvotes: 4
Reputation: 394469
You can groupby
on all the columns and call size
the index indicates the duplicate values:
In [28]:
df.groupby(df.columns.tolist(),as_index=False).size()
Out[28]:
one three two
False False True 1
True False False 2
True True 1
dtype: int64
Upvotes: 130