Mike
Mike

Reputation: 2557

Pandas: count number of duplicate rows using groupby

I have a dataframe with duplicate rows

>>> d = pd.DataFrame({'n': ['a', 'a', 'a'], 'v': [1,2,1]})
>>> d
   n  v
0  a  1
1  a  2
2  a  1

I would like to understand how to use .groupby() method specifically so that I can add a new column to the dataframe which shows count of rows which are identical to the current one.

>>> dd = d.groupby(by=['n','v'], as_index=False)  # Use all columns to find groups of identical rows
>>> for k,v in dd:
...      print(k, "\n", v, "\n")  # Check what we found
... 
('a', 1) 
    n  v
0  a  1
2  a  1 

('a', 2) 
    n  v
1  a  2 

When I'm trying to do dd.count() on resulting DataFrameGroupBy object I get IndexError: list index out of range. This seems to happen because all columns are used in grouping operation and there's no other column to use for counting. Similarly dd.agg({'n', 'count'}) fails with ValueError: no results.

I could use .apply() to achieve something that looks like result.

>>> dd.apply(lambda x: x.assign(freq=len(x)))
     n  v  freq
0 0  a  1     2
  2  a  1     2
1 1  a  2     1

However this has two issues: 1) something happens to the index so that it is hard to map this back to the original index, 2) this does not seem idiomatic Pandas and manuals discourage using .apply() as it could be slow.

Is there more idiomatic way to count duplicate rows when using .groupby()?

Upvotes: 1

Views: 2411

Answers (1)

jezrael
jezrael

Reputation: 862851

One solution is use GroupBy.size for aggregate output with counter:

d = d.groupby(by=['n','v']).size().reset_index(name='c')
print (d)
   n  v  c
0  a  1  2
1  a  2  1

Your solution working if specify some column name after groupby, because no another columns n, v in input DataFrame:

d = d.groupby(by=['n','v'])['n'].count().reset_index(name='c')

print (d)
   n  v  c
0  a  1  2
1  a  2  1

What is also necessary if need new column with GroupBy.transform - new column is filled by aggregate values:

d['c'] = d.groupby(by=['n','v'])['n'].transform('size')
print (d)
   n  v  c
0  a  1  2
1  a  2  1
2  a  1  2

Upvotes: 4

Related Questions