how to display duplicate ID along with duplicate data in the python pandas

Question

I have a data frame which looks below

k={'ID':[1,2,3,4,5,6],'Name':['John Danny','Micheal K','John Danny','jerred','John Danny','joe'],'phone':['1111',
                                                                                   '2222','2233','1111','2222','6666']}
df=pd.DataFrame(data=k)
df
    ID  Name       phone
    1   John Danny  1111
    2   Micheal K   2222
    3   John Danny  2233
    4   jerred      1111
    5   John Danny  2222

I need to find the duplicated in name and phone in the data frame so used the below-given code

df[df['Name'].duplicated(keep=False)].sort_values("Name")

duplicated based on name


ID  Name       phone
1   John Danny  1111
3   John Danny  2233
5   John Danny  2222

duplicated based on phone

    ID  Name       phone
    1   John Danny  1111
    4   jerred      1111
    2   Micheal K   2222
    5   John Danny  2222

but I want the result as follows

ID  Name      phone duplicated of name ids  duplicated of phone ids Duplicate_name  Duplicate_phone
1   John Danny  1111    3,5                    4                    Yes              Yes
2   Micheal K   2222                           5                     No              Yes
3   John Danny  2233    1,5                                          Yes              No
4   jerred      1111                           1                     No              Yes
5   John Danny  2222    1,3                    2                     Yes             Yes

I was able to find duplicate_name and duplicate_phone by using below code

df['Duplicate_name'] = df['Name'].duplicated(keep=False).map({True:'Yes', False:'No'})
df['Duplicate_phone'] = df['phone'].duplicated(keep=False).map({True:'Yes', False:'No'})

The problem is not able display ID in duplicated of phone ids and duplicated of name ids as above given result table how to do it?

jezrael · Accepted Answer

Use GroupBy.transform with custom function for subtract sets:

def f(x):
    return [', '.join(set(x) - set([y])) for y in x]

Or filter with if in generator:

def f(x):
    return [', '.join(z for z in x if z != y) for y in x]

df['duplicated of name ids'] = df['ID'].astype(str).groupby(df['Name']).transform(f)
df['duplicated of phone ids'] = df['ID'].astype(str).groupby(df['phone']).transform(f)


df['Duplicate_name'] = df['Name'].duplicated(keep=False).map({True:'Yes', False:'No'})
df['Duplicate_phone'] = df['phone'].duplicated(keep=False).map({True:'Yes', False:'No'})
print (df)
   ID        Name phone duplicated of name ids duplicated of phone ids  \
0   1  John Danny  1111                   5, 3                       4   
1   2   Micheal K  2222                                              5   
2   3  John Danny  2233                   5, 1                           
3   4      jerred  1111                                              1   
4   5  John Danny  2222                   1, 3                       2   
5   6         joe  6666                                                  

  Duplicate_name Duplicate_phone  
0            Yes             Yes  
1             No             Yes  
2            Yes              No  
3             No             Yes  
4            Yes             Yes  
5             No              No

how to display duplicate ID along with duplicate data in the python pandas

but I want the result as follows

The problem is not able display ID in duplicated of phone ids and duplicated of name ids as above given result table how to do it?

Answers (1)

Related Questions