Reputation: 2557
I have a dataframe with duplicate rows
>>> d = pd.DataFrame({'n': ['a', 'a', 'a'], 'v': [1,2,1]})
>>> d
n v
0 a 1
1 a 2
2 a 1
I would like to understand how to use .groupby()
method specifically so that I can add a new column to the dataframe which shows count of rows which are identical to the current one.
>>> dd = d.groupby(by=['n','v'], as_index=False) # Use all columns to find groups of identical rows
>>> for k,v in dd:
... print(k, "\n", v, "\n") # Check what we found
...
('a', 1)
n v
0 a 1
2 a 1
('a', 2)
n v
1 a 2
When I'm trying to do dd.count()
on resulting DataFrameGroupBy
object I get IndexError: list index out of range
. This seems to happen because all columns are used in grouping operation and there's no other column to use for counting. Similarly dd.agg({'n', 'count'})
fails with ValueError: no results
.
I could use .apply()
to achieve something that looks like result.
>>> dd.apply(lambda x: x.assign(freq=len(x)))
n v freq
0 0 a 1 2
2 a 1 2
1 1 a 2 1
However this has two issues: 1) something happens to the index so that it is hard to map this back to the original index, 2) this does not seem idiomatic Pandas and manuals discourage using .apply()
as it could be slow.
Is there more idiomatic way to count duplicate rows when using .groupby()
?
Upvotes: 1
Views: 2411
Reputation: 862851
One solution is use GroupBy.size
for aggregate output with counter:
d = d.groupby(by=['n','v']).size().reset_index(name='c')
print (d)
n v c
0 a 1 2
1 a 2 1
Your solution working if specify some column name after groupby
, because no another columns n
, v
in input DataFrame
:
d = d.groupby(by=['n','v'])['n'].count().reset_index(name='c')
print (d)
n v c
0 a 1 2
1 a 2 1
What is also necessary if need new column with GroupBy.transform
- new column is filled by aggregate values:
d['c'] = d.groupby(by=['n','v'])['n'].transform('size')
print (d)
n v c
0 a 1 2
1 a 2 1
2 a 1 2
Upvotes: 4