jss367
jss367

Reputation: 5411

How to count duplicate rows in pandas dataframe?

I am trying to count the duplicates of each type of row in my dataframe. For example, say that I have a dataframe in pandas as follows:

df = pd.DataFrame({'one': pd.Series([1., 1, 1]),
                   'two': pd.Series([1., 2., 1])})

I get a df that looks like this:

    one two
0   1   1
1   1   2
2   1   1

I imagine the first step is to find all the different unique rows, which I do by:

df.drop_duplicates()

This gives me the following df:

    one two
0   1   1
1   1   2

Now I want to take each row from the above df ([1 1] and [1 2]) and get a count of how many times each is in the initial df. My result would look something like this:

Row     Count
[1 1]     2
[1 2]     1

How should I go about doing this last step?

Edit:

Here's a larger example to make it more clear:

df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
                   'two': pd.Series([True, False, False, True]),
                   'three': pd.Series([True, False, False, False])})

gives me:

    one three   two
0   True    True    True
1   True    False   False
2   True    False   False
3   False   False   True

I want a result that tells me:

       Row           Count
[True True True]       1
[True False False]     2
[False False True]     1

Upvotes: 126

Views: 327416

Answers (12)

Roman Skrypin
Roman Skrypin

Reputation: 21

df.groupby(df.columns.tolist()).size().reset_index(name='count')
   one  two    count
0    1    1        2
1    1    2        1

Upvotes: 0

Alireza Ghaffari
Alireza Ghaffari

Reputation: 1104

It is as easy as:

df = pd.DataFrame({'one': pd.Series([True, True, True, False]),
                   'two': pd.Series([True, False, False, True]),
                   'three': pd.Series([True, False, False, False])})


rs = pd.DataFrame(df.value_counts(sort=False).index.to_list(), columns=df.columns)
rs["#"] = df.value_counts(sort=False).values


    one     two     three   #
0   False   True    False   1
1   True    False   False   2
2   True    True    True    1

But, if you want to just inform this piece of code is enough:

df.value_counts(sort=False)

Upvotes: 0

Ammad
Ammad

Reputation: 131

If you just need to find a count the for unique and duplicate rows (entire row duplicated) this could work:

df.duplicated().value_counts()

output: False 11398 True 154 dtype: int64

Upvotes: 12

SeaBean
SeaBean

Reputation: 23237

If you find some counts missing or get error: ValueError: Length mismatch: Expected axis has nnnn elements, new values have mmmm elements, read here:

1. Count duplicate rows with NaN entries:

The accepted solution is great and believed to have been helpful to many members. In a recent task, I found it can be further fine-tuned to support complete counting of a dataframe with NaN entries. Pandas supports missing entries or null values as NaN values. Let's see what's the output for this use case when our dataframe contains NaN entries:

  Col1  Col2 Col3 Col4
0  ABC   123  XYZ  NaN       # group #1 of 3
1  ABC   123  XYZ  NaN       # group #1 of 3
2  ABC   678  PQR  def           # group #2 of 1
3  MNO   890  EFG  abc               # group #3 of 4 
4  MNO   890  EFG  abc               # group #3 of 4 
5  CDE   234  567  xyz                   # group #4 of 2 
6  ABC   123  XYZ  NaN       # group #1 of 3
7  CDE   234  567  xyz                   # group #4 of 2 
8  MNO   890  EFG  abc               # group #3 of 4 
9  MNO   890  EFG  abc               # group #3 of 4 

Applying the code:

df.groupby(df.columns.tolist(),as_index=False).size()

gives:

  Col1  Col2 Col3 Col4  size
0  ABC   678  PQR  def     1
1  CDE   234  567  xyz     2
2  MNO   890  EFG  abc     4

Oh, how come the count of Group #1 with 3 duplicate rows is missing?!

For some Pandas version, you may get an error instead: ValueError: Length mismatch: Expected axis has nnnn elements, new values have mmmm elements

Solution:

Use the parameter dropna= for the .groupby() function, as follows:

df.groupby(df.columns.tolist(), as_index=False, dropna=False).size()

gives:

  Col1  Col2 Col3 Col4  size
0  ABC   123  XYZ  NaN     3          # <===  count of rows with `NaN`
1  ABC   678  PQR  def     1
2  CDE   234  567  xyz     2
3  MNO   890  EFG  abc     4

The count of duplicate rows with NaN can be successfully output with dropna=False. This parameter has been supported since Pandas version 1.1.0


2. Alternative Solution

Another way to count duplicate rows with NaN entries is as follows:

df.value_counts(dropna=False).reset_index(name='count')

gives:

  Col1  Col2 Col3 Col4  count
0  MNO   890  EFG  abc      4
1  ABC   123  XYZ  NaN      3
2  CDE   234  567  xyz      2
3  ABC   678  PQR  def      1

Here, we use the .value_counts() function with also the parameter dropna=False. However, this parameter has been supported only recently since Pandas version 1.3.0 If your version is older than this, you'll need to use the .groupby() solution if you want to get complete counts for rows with NaN entries.

You will see that the output is in different sequence than the previous result. The counts are sorted in descending order. If you want to get unsorted result, you can specify sort=False:

df.value_counts(dropna=False, sort=False).reset_index(name='count')

it gives the same result as the df.groupby(df.columns.tolist(), as_index=False, dropna=False).size() solution:

  Col1  Col2 Col3 Col4  count
0  ABC   123  XYZ  NaN      3
1  ABC   678  PQR  def      1
2  CDE   234  567  xyz      2
3  MNO   890  EFG  abc      4

Note that this .value_counts() solution supports dataframes both with and without NaN entries and can be used as a general solution.

In fact, in the underlying implementation codes .value_counts() calls GroupBy.size to get the counts: click the link to see the underlying codes: counts = self.groupby(subset, dropna=dropna).grouper.size()

Hence, for this use case, .value_counts() and the .groupby() solution in the accepted solution are actually doing the same thing. We should be able to use the .value_counts() function to get the desired counts of duplicate rows equally well.

Use of .value_counts() function to get counts of duplicate rows has the additional benefit that its syntax is simpler. You can simply use df.value_counts() or df.value_counts(dropna=False) depending on whether your dataframe contains NaN or not. Chain with .reset_index() if you want the result as a dataframe instead of a Series.

Upvotes: 6

Arash
Arash

Reputation: 1064

Specific to your question, as the others mentioned fast and easy way would be:

df.groupby(df.columns.tolist(),as_index=False).size()

If you like to count duplicates on particular column(s):

len(df['one'])-len(df['one'].drop_duplicates())

If you want to count duplicates on entire dataframe:

len(df)-len(df.drop_duplicates())

Or simply you can use DataFrame.duplicated(subset=None, keep='first'):

df.duplicated(subset='one', keep='first').sum()

where

subset : column label or sequence of labels(by default use all of the columns)

keep : {‘first’, ‘last’, False}, default ‘first’

  • first : Mark duplicates as True except for the first occurrence.
  • last : Mark duplicates as True except for the last occurrence.
  • False : Mark all duplicates as True.

Upvotes: 61

Mykola Zotko
Mykola Zotko

Reputation: 17911

To count rows in DataFrame you can use the method value_counts (Pandas 1.1.0):

df = pd.DataFrame({'A': [1, 1, 2, 2, 3], 'B': [10, 10, 20, 20, 30]})

df.value_counts().reset_index(name='counts').query('counts > 1')

Output:

   A   B  counts
0  1  10       2
1  2  20       2

Upvotes: 0

ThibTrip
ThibTrip

Reputation: 21

ran into this problem today and wanted to include NaNs so I replace them temporarily with "" (empty string). Please comment if you do not understand something :). This solution assumes that "" is not a relevant value for you. It should also work with numerical data (I have tested it sucessfully but not extensively) since pandas will infer the data type again after replacing "" with np.nan.

import pandas as pd

# create test data
df = pd.DataFrame({'test':['foo','bar',None,None,'foo'],
                  'test2':['bar',None,None,None,'bar'],
                  'test3':[None, 'foo','bar',None,None]})

# fill null values with '' to not lose them during groupby
# groupby all columns and calculate the length of the resulting groups
# rename the series obtained with groupby to "group_count"
# reset the index to get a DataFrame
# replace '' with np.nan (this reverts our first operation)
# sort DataFrame by "group_count" descending
df = (df.fillna('')\
      .groupby(df.columns.tolist()).apply(len)\
      .rename('group_count')\
      .reset_index()\
      .replace('',np.nan)\
      .sort_values(by = ['group_count'], ascending = False))
df
  test test2 test3  group_count
3  foo   bar   NaN            2
0  NaN   NaN   NaN            1
1  NaN   NaN   bar            1
2  bar   NaN   foo            1

Upvotes: 2

Sergey  Zaitsev
Sergey Zaitsev

Reputation: 585

I use:

used_features =[
    "one",
    "two",
    "three"
]

df['is_duplicated'] = df.duplicated(used_features)
df['is_duplicated'].sum()

which gives count of duplicated rows, and then you can analyse them by a new column. I didn't see such solution here.

Upvotes: 13

Denis
Denis

Reputation: 511

df.groupby(df.columns.tolist()).size().reset_index().\
    rename(columns={0:'records'})

   one  two  records
0    1    1        2
1    1    2        1

Upvotes: 51

olisteadman
olisteadman

Reputation: 452

None of the existing answers quite offers a simple solution that returns "the number of rows that are just duplicates and should be cut out". This is a one-size-fits-all solution that does:

# generate a table of those culprit rows which are duplicated:
dups = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'})

# sum the final col of that table, and subtract the number of culprits:
dups['count'].sum() - dups.shape[0]

Upvotes: 8

Jarad
Jarad

Reputation: 18983

df = pd.DataFrame({'one' : pd.Series([1., 1, 1, 3]), 'two' : pd.Series([1., 2., 1, 3] ), 'three' : pd.Series([1., 2., 1, 2] )})
df['str_list'] = df.apply(lambda row: ' '.join([str(int(val)) for val in row]), axis=1)
df1 = pd.DataFrame(df['str_list'].value_counts().values, index=df['str_list'].value_counts().index, columns=['Count'])

Produces:

>>> df1
       Count
1 1 1      2
3 2 3      1
1 2 2      1

If the index values must be a list, you could take the above code a step further with:

df1.index = df1.index.str.split()

Produces:

           Count
[1, 1, 1]      2
[3, 2, 3]      1
[1, 2, 2]      1

Upvotes: 4

EdChum
EdChum

Reputation: 394469

You can groupby on all the columns and call size the index indicates the duplicate values:

In [28]:
df.groupby(df.columns.tolist(),as_index=False).size()

Out[28]:
one    three  two  
False  False  True     1
True   False  False    2
       True   True     1
dtype: int64

Upvotes: 130

Related Questions