nipy
nipy

Reputation: 5488

How many times is a particular row present?

I am just starting to look at Pandas and trying to work out how to calculate how many times a particular row is seen. I need to add a new column showing how many times a duplicate row is seen and show only the unique rows.

So this:

team    runs  
team 1   5  
team 1   5  
team 1   5  
team 2   5  

Would become this:

team    runs occurrences   
team 1   5      3  
team 2   5      1  

Upvotes: 2

Views: 103

Answers (2)

You can try this also

In [39]: hashed = df.apply(lambda x: hash(str(x.values+1)), axis=1)

In [40]: hashed
Out[40]: 
0    4112993419872972624
1    4112993419872972626
2    4112993419872972628
3    7113020419917972579
4    6113011419891972603
5    6113011419891972603
dtype: int32

Upvotes: 0

Jeff
Jeff

Reputation: 128958

This is pretty easy once you figure out what an identical row means. I simply use the hash of the stringifed values. If you have an alternate definition then that would work as well.

In [37]: df = DataFrame({'A' : [1,1,1,2,3,3], 'B' : [2,2,2,2,3,3]})

In [38]: df
Out[38]: 
   A  B
0  1  2
1  1  2
2  1  2
3  2  2
4  3  3
5  3  3

Compute a hash for each row. Identical 'rows' yield identical hashes

In [39]: hashed = df.apply(lambda x: hash(str(x.values)), axis=1)

In [40]: hashed
Out[40]: 
0    4112993419872972622
1    4112993419872972622
2    4112993419872972622
3    7113020419917972579
4    6113011419891972603
5    6113011419891972603
dtype: int64

Map the value counts back to the original indexes. You can pass take_last=False to .drop_duplicates() if you want the first unique row (rather than the last)

In [41]: hashed.drop_duplicates().map(hashed.value_counts())
Out[41]: 
0    3
3    1
4    2
dtype: int64

Upvotes: 3

Related Questions