Eleanor
Eleanor

Reputation: 2847

Check non-duplicates by groups Python

I want to check non-duplicates in groups and return all these non-duplicates but only when these keys shows multiple times. For example,

Var  Code   Description
1    A      ABC
2    B      EF
1    A      ABCD
2    B      EF
1    B      ABC
3    C      DD

I expect to get some result like this

Var  Code Description
1    A    ABC
1    A    ABCD

The reason I want this table is to detect which (var,code) pair have multiple description, because in my case, if it has multiple description, it means the description for these pairs should be modified to be unique.

This table returns rows that have same variable name and same code name but different description. Because for my data, a code of a variable should have only one description instead of multiple descriptions.

3 C DD is not returned is because there is only one record about (3,C) pair and DD is the only description for this pair.

2 B EF is not returned is because though two rows contain (2,B) pair, they have the same description which is normal.

1 B ABC is not returned is because though var 1 shows three times but (1,B) pair shows only one time and has only one description ABC.

Any solutions?

Thank you!

Upvotes: 1

Views: 382

Answers (2)

robertwest
robertwest

Reputation: 942

msk1 = df.duplicated(subset=['Code', 'Var'], keep=False)
msk2 = df.duplicated(subset=['Code', 'Var', 'Description'], keep=False)
msk3 = msk1 & ~msk2
df[msk3]

# msk1 filters to all rows where both Var and Code are duplicated
print(df[msk1])
Var  Code Description
1    A    ABC
1    A    ABCD
2    B    EF
2    B    EF

# msk2 further filters to where Var Code and Description are duplicated 
print(df[msk2])
Var Code Description 
2   B    EF
2   B    EF

# so then finally msk3 is msk1 without msk2 
# give me all rows where Var and Code are duplicated 
# except for the ones where the Description is also duplicated
# - the desired result 
print(df[msk3])
Var  Code Description
1    A    ABC
1    A    ABCD

Upvotes: 0

Vaishali
Vaishali

Reputation: 38415

You can use groupby and filter the rows using nunique

new_df = df.groupby(['Var', 'Code']).filter(lambda x: x['Description'].nunique() > 1) 

    Var Code    Description
0   1   A       ABC
2   1   A       ABCD

Upvotes: 3

Related Questions