Reputation: 4501
So i have a dataframe like this
No fname sname landline address
1 Alphred Thomas 123 A
2 Peter Jay 345 B
3 Donald Hook 123 A
4 Jay Donald 345 B
5 Jay Donald 123 A
6 Haskell Peter 123 B
Now i want to get all duplicates of landline and address together. So in the case above, the group (123,A) would be one set of recurring entities and (345,B) would be another set of recurring entities. I want to ignore (123,B) since this occurs only once.
Now for each of the duplicate groups I want to check if a single name occurs in both the fnmae and sname column. So in the case of (123,A)we want to to capture the lines where donald is appearing on both the fname and sname (basically they have to be two different rows and the two columns have to have similar names) So in the case above we would pick row 3 and 5. I want to perform one more operation on this after we pick this. Some sort of date at which name was entered check.
How do i achieve this? I tried using duplicated but this doesn't help so much for the second comparison?
Upvotes: 0
Views: 111
Reputation: 863281
You can use groupby
with isin
for mask and then boolean indexing
:
mask = df.groupby(['landline','address']).apply(lambda x: x.fname.isin(x.sname) |
x.sname.isin(x.fname) &
(len(x) > 1))
mask = mask.reset_index(level=['landline','address'], drop=True).sort_index()
print (mask)
0 False
1 True
2 True
3 True
4 True
5 False
dtype: bool
df1 = df[mask]
print (df1)
No fname sname landline address
1 2 Peter Jay 345 B
2 3 Donald Hook 123 A
3 4 Jay Donald 345 B
4 5 Jay Donald 123 A
EDIT: I think you can use custom function with filtering
:
def f(x):
print (x)
mask = x.fname.isin(x.sname) | x.sname.isin(x.fname) & (len(x) > 1)
x1 = x[mask]
return x1
df2 = df.groupby(['landline','address']).apply(f).reset_index(drop=True)
print (df2)
No fname sname landline address
0 3 Donald Hook 123 A
1 5 Jay Donald 123 A
2 2 Peter Jay 345 B
3 4 Jay Donald 345 B
Upvotes: 2