CodeGeek123
CodeGeek123

Reputation: 4501

How do i detect duplicates and then among them cross check if two columns have similar values?

So i have a dataframe like this

 No    fname        sname        landline        address
 1   Alphred      Thomas         123              A
 2   Peter        Jay            345              B
 3   Donald       Hook           123              A
 4   Jay          Donald         345              B
 5   Jay          Donald         123              A
 6   Haskell      Peter          123              B

Now i want to get all duplicates of landline and address together. So in the case above, the group (123,A) would be one set of recurring entities and (345,B) would be another set of recurring entities. I want to ignore (123,B) since this occurs only once.

Now for each of the duplicate groups I want to check if a single name occurs in both the fnmae and sname column. So in the case of (123,A)we want to to capture the lines where donald is appearing on both the fname and sname (basically they have to be two different rows and the two columns have to have similar names) So in the case above we would pick row 3 and 5. I want to perform one more operation on this after we pick this. Some sort of date at which name was entered check.

How do i achieve this? I tried using duplicated but this doesn't help so much for the second comparison?

Upvotes: 0

Views: 111

Answers (1)

jezrael
jezrael

Reputation: 863281

You can use groupby with isin for mask and then boolean indexing:

mask = df.groupby(['landline','address']).apply(lambda x: x.fname.isin(x.sname) | 
                                                          x.sname.isin(x.fname) & 
                                                            (len(x) > 1))
mask = mask.reset_index(level=['landline','address'], drop=True).sort_index()
print (mask)
0    False
1     True
2     True
3     True
4     True
5    False
dtype: bool

df1 = df[mask]
print (df1)
   No   fname   sname  landline address
1   2   Peter     Jay       345       B
2   3  Donald    Hook       123       A
3   4     Jay  Donald       345       B
4   5     Jay  Donald       123       A

EDIT: I think you can use custom function with filtering:

def f(x):
    print (x)
    mask = x.fname.isin(x.sname) | x.sname.isin(x.fname) & (len(x) > 1)
    x1 = x[mask]
    return x1


df2 = df.groupby(['landline','address']).apply(f).reset_index(drop=True)
print (df2)
   No   fname   sname  landline address
0   3  Donald    Hook       123       A
1   5     Jay  Donald       123       A
2   2   Peter     Jay       345       B
3   4     Jay  Donald       345       B

Upvotes: 2

Related Questions