Reputation: 33
I'm working on a dataset that looks like this:
col1
person1 gene1
person1 gene1
person1 gene2
person1 gene3
person1 gene4
person2 gene1
person2 gene2
person2 gene3
person2 gene4
person3 gene1
person3 gene1
person3 gene1
person3 gene2
person3 gene3
person3 gene3
person3 gene4
For each person, I want to count the number of times a gene appears more than once.
For example, in the case I presented above, person1 has gene1 duplicated, person2 has no genes duplicated, and person3 has gene1 and gene3 duplicated. Thus, I would want my code to output 3.
I know that there is a duplicated pandas code: DataFrame.duplicated(subset=None, keep='first')
However, trying to apply it to my dataframe, I keep getting told I need to apply it?
Thanks
I added a clarification for additional help:
person1 gene1
person1 gene1
person1 gene2
person1 gene2
person2 gene1
person2 gene1
person3 gene1
person3 gene1
person3 gene2
person3 gene2
person3 gene2
Upvotes: 1
Views: 31
Reputation: 323266
You can do with size
df.groupby([*df.columns]).size().gt(1).sum()
Out[37]: 3
Upvotes: 1