Reputation: 33
My dataframe has a category and subcategory column, and then a column with strings that sometimes are repeated.
My question is for each category (CAT A) which strings are repeated across subcategories (CAT B)
CAT A | CAT B | Strings |
---|---|---|
A1 | B1 | String1 |
A1 | B1 | String2 |
A1 | B1 | String3 |
A1 | B2 | String4 |
A1 | B2 | String5 |
A1 | B2 | String1 |
A2 | B1 | String1 |
A2 | B1 | String2 |
A2 | B1 | String3 |
A2 | B2 | String4 |
A2 | B2 | String5 |
A2 | B2 | String6 |
The output I am looking for
A1
Repeated strings in B1 and B2
"String1"
---
A2
Repeated strings in B1 and B2
None
I'm confused on how to group this and compare the groups.
Thanks
Upvotes: 0
Views: 625
Reputation: 24314
You can try via duplicated()
with keep=False
m=df.duplicated(subset=['CAT A','Strings'],keep=False)
#OR via groupby()+transform()
#m=df.groupby('CAT A')['Strings'].transform(lambda x:x.duplicated(keep=False))
Finally:
out=df.loc[m]
output of out
:
CAT A CAT B Strings
0 A1 B1 String1
5 A1 B2 String1
If needed a seperate column:
df.loc[m,'duplicated']=df.loc[m,'Strings']
output of out
:
CAT A CAT B Strings duplicated
0 A1 B1 String1 String1
1 A1 B1 String2 NaN
2 A1 B1 String3 NaN
3 A1 B2 String4 NaN
4 A1 B2 String5 NaN
5 A1 B2 String1 String1
6 A2 B1 String1 NaN
7 A2 B1 String2 NaN
8 A2 B1 String3 NaN
9 A2 B2 String4 NaN
10 A2 B2 String5 NaN
11 A2 B2 String6 NaN
Upvotes: 0