ceharep
ceharep

Reputation: 439

pandas - get count of duplicate rows (matching across multiple columns)

I have a table like below - unique IDs and names. I want to return any duplicated names (based on matching First and Last).

   Id First   Last  
    1  Dave  Davis        
    2  Dave  Smith       
    3   Bob  Smith    
    4  Dave  Smith     

I've managed to return a count of duplicates across all columns if I don't have an ID column, i.e.

import pandas as pd

dict2 = {'First': pd.Series(["Dave", "Dave", "Bob", "Dave"]),
                   'Last': pd.Series(["Davis", "Smith", "Smith", "Smith"])}
df2 = pd.DataFrame(dict2)

print(df2.groupby(df2.columns.tolist()).size().reset_index().\
    rename(columns={0:'records'}))

Output:

  First   Last  records
0   Bob  Smith        1
1  Dave  Davis        1
2  Dave  Smith        2

I want to be able to return the duplicates (of first and last) when I also have an ID column, i.e.

import pandas as pd

dict1 = {'Id': pd.Series([1, 2, 3, 4]),
                    'First': pd.Series(["Dave", "Dave", "Bob", "Dave"]),
                   'Last': pd.Series(["Davis", "Smith", "Smith", "Smith"])}
df1 = pd.DataFrame(dict1)

print(df1.groupby(df1.columns.tolist()).size().reset_index().\
    rename(columns={0:'records'}))

gives:

   Id First   Last  records
0   1  Dave  Davis        1
1   2  Dave  Smith        1
2   3   Bob  Smith        1
3   4  Dave  Smith        1

I want (ideally):

  First   Last  records   Ids
0  Dave  Smith        2   2, 4

Upvotes: 1

Views: 929

Answers (1)

jezrael
jezrael

Reputation: 862771

first filter only duplicated rows by DataFrame.duplicated by columns for check and keep=False for return all dupes, filter by boolean indexing. Then aggregate by GroupBy.agg counts with GroupBy.size and join id with converting to strings:

tup = [('records','size'), ('Ids',lambda x: ','.join(x.astype(str)))]
df2 = (df1[df1.duplicated(['First','Last'], keep=False)]
       .groupby(['First','Last'])['Id'].agg(tup)
       .reset_index())
print (df2)
  First   Last  records  Ids
0  Dave  Smith        2  2,4

Another idea is aggregate all values and then filter with DataFrame.query:

tup = [('records','size'), ('Ids',lambda x: ','.join(x.astype(str)))]
df2 = (df1.groupby(['First','Last'])['Id'].agg(tup)
          .reset_index()
          .query('records != 1'))
print (df2)
  First   Last  records  Ids
2  Dave  Smith        2  2,4

Upvotes: 1

Related Questions