Reputation: 8008
forgive me if this is a basic question but i am new to pandas. I have a dataframe with with a column A and i would like to get the top n rows based on the count in Column A. For instance the raw data looks like
A B C
x 12 ere
x 34 bfhg
z 6 bgn
z 8 rty
y 567 hmmu,,u
x 545 fghfgj
x 44 zxcbv
Note that this is just a small sample of the data that i am actually working with.
So if we look at Column A, value x appears 4 times,y appears 2 times and z appears 1 time. How can i get the top n values for Column A based on this count?
print df.groupby(['A']).sum()
this gives me
A B
x 6792117
but when i do
print len(df.groupby(['A']).get_group('x'))
i get
21
furthermore
len(df.index)
gives me
23657
so how can the count of 'A' == 'x'
be 6792117
as seen in the result of group by? what am i missing?
Update
consider
print df.groupby(['A']).describe()
gives me
Tags DocID
x count 21.000000
mean 323434.142857
std 35677.410292
min 266631.000000
25% 292054.000000
50% 325575.000000
75% 347450.000000
max 380286.000000
which makes sense. i just want to get the row which has the max count as per column A.
Update2
i did
print df.groupby(['A'],as_index=False).count()
i get
A B C
0 x 21 21
1 y 11 11
2 z 8 8
so basically, for Column A, tag x has 21 entries in Column B and 21 in Column C. ColumnsB and C are unique in my case. which is good. now how do i get the top n rows with respect to column C?
Update3
So i tried
import heapq
print heapq.nlargest(3,df.groupby(['A'],as_index=False).count()['C'])
and i get
[151, 85, 72]
so i know that for Column A, i have the above counts as the top 3 counts. But i still dont know which value of Column A do these counts refer to? For example which value in Column A has a count of 151? Is there any way to link this information?
Upvotes: 3
Views: 15711
Reputation: 862481
IIUC you can use function nlargest
.
I try your sample data and get top 2 rows by column C
:
print df
A B C
0 x 12 ere
1 x 34 bfhg
2 z 6 bgn
3 z 8 rty
4 y 567 hmmu,,u
5 x 545 fghfgj
6 x 44 zxcbv
dcf = df.groupby(['A'],as_index=False).count()
print dcf
A B C
0 x 4 4
1 y 1 1
2 z 2 2
#get 2 largest rows by column C
print dcf.nlargest(2,'C')
A B C
0 x 4 4
2 z 2 2
Upvotes: 5
Reputation: 8008
one approach that i tried
import heapq
dcf = df.groupby(['A'],as_index=False).count()
print dcf.loc[dcf['C'].isin(heapq.nlargest(5,dcf['C']))].sort(['C'],ascending=False)
gives me
A B C
1664 g 151 151
1887 k 85 85
1533 q 72 72
53 y 68 68
1793 t 62 62
verified by
print len(df.loc[df["A"]=="g"])
gives me
151
so i get the desired results as i can see the top 5 values based on the count from Column A. but surely there must be a better way of doing this?
Upvotes: 0