Reputation: 439
I have a table like below - unique IDs and names. I want to return any duplicated names (based on matching First and Last).
Id First Last
1 Dave Davis
2 Dave Smith
3 Bob Smith
4 Dave Smith
I've managed to return a count of duplicates across all columns if I don't have an ID column, i.e.
import pandas as pd
dict2 = {'First': pd.Series(["Dave", "Dave", "Bob", "Dave"]),
'Last': pd.Series(["Davis", "Smith", "Smith", "Smith"])}
df2 = pd.DataFrame(dict2)
print(df2.groupby(df2.columns.tolist()).size().reset_index().\
rename(columns={0:'records'}))
Output:
First Last records
0 Bob Smith 1
1 Dave Davis 1
2 Dave Smith 2
I want to be able to return the duplicates (of first and last) when I also have an ID column, i.e.
import pandas as pd
dict1 = {'Id': pd.Series([1, 2, 3, 4]),
'First': pd.Series(["Dave", "Dave", "Bob", "Dave"]),
'Last': pd.Series(["Davis", "Smith", "Smith", "Smith"])}
df1 = pd.DataFrame(dict1)
print(df1.groupby(df1.columns.tolist()).size().reset_index().\
rename(columns={0:'records'}))
gives:
Id First Last records
0 1 Dave Davis 1
1 2 Dave Smith 1
2 3 Bob Smith 1
3 4 Dave Smith 1
I want (ideally):
First Last records Ids
0 Dave Smith 2 2, 4
Upvotes: 1
Views: 929
Reputation: 862771
first filter only duplicated rows by DataFrame.duplicated
by columns for check and keep=False
for return all dupes, filter by boolean indexing
. Then aggregate by GroupBy.agg
counts with GroupBy.size
and join id
with converting to strings
:
tup = [('records','size'), ('Ids',lambda x: ','.join(x.astype(str)))]
df2 = (df1[df1.duplicated(['First','Last'], keep=False)]
.groupby(['First','Last'])['Id'].agg(tup)
.reset_index())
print (df2)
First Last records Ids
0 Dave Smith 2 2,4
Another idea is aggregate all values and then filter with DataFrame.query
:
tup = [('records','size'), ('Ids',lambda x: ','.join(x.astype(str)))]
df2 = (df1.groupby(['First','Last'])['Id'].agg(tup)
.reset_index()
.query('records != 1'))
print (df2)
First Last records Ids
2 Dave Smith 2 2,4
Upvotes: 1