Reputation: 2847
I want to check non-duplicates in groups and return all these non-duplicates but only when these keys shows multiple times. For example,
Var Code Description
1 A ABC
2 B EF
1 A ABCD
2 B EF
1 B ABC
3 C DD
I expect to get some result like this
Var Code Description
1 A ABC
1 A ABCD
The reason I want this table is to detect which (var,code) pair have multiple description, because in my case, if it has multiple description, it means the description for these pairs should be modified to be unique.
This table returns rows that have same variable name and same code name but different description. Because for my data, a code of a variable should have only one description instead of multiple descriptions.
3 C DD is not returned is because there is only one record about (3,C) pair and DD is the only description for this pair.
2 B EF is not returned is because though two rows contain (2,B) pair, they have the same description which is normal.
1 B ABC is not returned is because though var 1 shows three times but (1,B) pair shows only one time and has only one description ABC.
Any solutions?
Thank you!
Upvotes: 1
Views: 382
Reputation: 942
msk1 = df.duplicated(subset=['Code', 'Var'], keep=False)
msk2 = df.duplicated(subset=['Code', 'Var', 'Description'], keep=False)
msk3 = msk1 & ~msk2
df[msk3]
# msk1 filters to all rows where both Var and Code are duplicated
print(df[msk1])
Var Code Description
1 A ABC
1 A ABCD
2 B EF
2 B EF
# msk2 further filters to where Var Code and Description are duplicated
print(df[msk2])
Var Code Description
2 B EF
2 B EF
# so then finally msk3 is msk1 without msk2
# give me all rows where Var and Code are duplicated
# except for the ones where the Description is also duplicated
# - the desired result
print(df[msk3])
Var Code Description
1 A ABC
1 A ABCD
Upvotes: 0
Reputation: 38415
You can use groupby and filter the rows using nunique
new_df = df.groupby(['Var', 'Code']).filter(lambda x: x['Description'].nunique() > 1)
Var Code Description
0 1 A ABC
2 1 A ABCD
Upvotes: 3