Reputation: 5488
I am just starting to look at Pandas and trying to work out how to calculate how many times a particular row is seen. I need to add a new column showing how many times a duplicate row is seen and show only the unique rows.
So this:
team runs
team 1 5
team 1 5
team 1 5
team 2 5
Would become this:
team runs occurrences
team 1 5 3
team 2 5 1
Upvotes: 2
Views: 103
Reputation: 83
You can try this also
In [39]: hashed = df.apply(lambda x: hash(str(x.values+1)), axis=1)
In [40]: hashed
Out[40]:
0 4112993419872972624
1 4112993419872972626
2 4112993419872972628
3 7113020419917972579
4 6113011419891972603
5 6113011419891972603
dtype: int32
Upvotes: 0
Reputation: 128958
This is pretty easy once you figure out what an identical row means. I simply use the hash of the stringifed values. If you have an alternate definition then that would work as well.
In [37]: df = DataFrame({'A' : [1,1,1,2,3,3], 'B' : [2,2,2,2,3,3]})
In [38]: df
Out[38]:
A B
0 1 2
1 1 2
2 1 2
3 2 2
4 3 3
5 3 3
Compute a hash for each row. Identical 'rows' yield identical hashes
In [39]: hashed = df.apply(lambda x: hash(str(x.values)), axis=1)
In [40]: hashed
Out[40]:
0 4112993419872972622
1 4112993419872972622
2 4112993419872972622
3 7113020419917972579
4 6113011419891972603
5 6113011419891972603
dtype: int64
Map the value counts back to the original indexes. You can pass take_last=False
to .drop_duplicates()
if you want the first unique row (rather than the last)
In [41]: hashed.drop_duplicates().map(hashed.value_counts())
Out[41]:
0 3
3 1
4 2
dtype: int64
Upvotes: 3