Damon Obako
Damon Obako

Reputation: 33

PANDAS : How can I output duplicate values within a group?

My dataframe has a category and subcategory column, and then a column with strings that sometimes are repeated.

My question is for each category (CAT A) which strings are repeated across subcategories (CAT B)

CAT A CAT B Strings
A1 B1 String1
A1 B1 String2
A1 B1 String3
A1 B2 String4
A1 B2 String5
A1 B2 String1
A2 B1 String1
A2 B1 String2
A2 B1 String3
A2 B2 String4
A2 B2 String5
A2 B2 String6

The output I am looking for

A1
Repeated strings in  B1 and B2
"String1"

---

A2
Repeated strings in  B1 and B2
None

I'm confused on how to group this and compare the groups.

Thanks

Upvotes: 0

Views: 625

Answers (1)

Anurag Dabas
Anurag Dabas

Reputation: 24314

You can try via duplicated() with keep=False

m=df.duplicated(subset=['CAT A','Strings'],keep=False)
#OR via groupby()+transform()
#m=df.groupby('CAT A')['Strings'].transform(lambda x:x.duplicated(keep=False))

Finally:

out=df.loc[m]

output of out:

   CAT A    CAT B   Strings
0   A1      B1      String1
5   A1      B2      String1

If needed a seperate column:

df.loc[m,'duplicated']=df.loc[m,'Strings']

output of out :

   CAT A CAT B  Strings duplicated
0     A1    B1  String1    String1
1     A1    B1  String2        NaN
2     A1    B1  String3        NaN
3     A1    B2  String4        NaN
4     A1    B2  String5        NaN
5     A1    B2  String1    String1
6     A2    B1  String1        NaN
7     A2    B1  String2        NaN
8     A2    B1  String3        NaN
9     A2    B2  String4        NaN
10    A2    B2  String5        NaN
11    A2    B2  String6        NaN

Upvotes: 0

Related Questions