joao1111
joao1111

Reputation: 3

Pandas - How to sum the count of a groupby() to sort by that sum

After some data manipulation and using a pivot table I was able to arrive at this desired result:

domain  code    count
a.com       200     10
            502     5
            404     8
c.com       200     31
            502     9
            503     15
b.com       200     5
            404     2
            503     1
d.com       200     47
            404     22

My objective now is to sum the count for each domain and sort the domains in descending order of that count like this:

domain  code    count
d.com       200     47
            404     22
c.com       200     31
            502     9
            503     15
a.com       200     10
            502     5
            404     8
b.com       200     5
            404     2
            503     1

If I do d.groupby('domain').sum() I just get the domains and the sum of count, but I don't want to present the sum, just need it for the sorting.

Additionally I'd like to add a percentage for each code relative to each domain, like this:

domain  code    count       %
d.com       200     47      68,115942029
            404     22      31,884057971
c.com       200     31      56,363636363
            502     9       16,363636363
            503     15      27,272727272
a.com       200     10      etc...
            502     5
            404     8
b.com       200     5
            404     2
            503     1

Upvotes: 0

Views: 1202

Answers (2)

rhug123
rhug123

Reputation: 8768

I believe you can use sort values for this.

df = df.sort_values(by=['domain','count'],ascending = (0,0))

Upvotes: 0

akuiper
akuiper

Reputation: 214957

You need groupby.transform and also argsort if you don't want to attach an extra column to the data frame:

# calculate sum by group            
domain_sum = df['count'].groupby(df.domain).transform('sum')

# sort values in descending order
df = df.loc[(-domain_sum).argsort()]

# calculate percentage
df['%'] = df['count'] / domain_sum

print(df.reset_index(drop=True))

#   domain  code  count         %
#0   d.com   200     47  0.681159
#1   d.com   404     22  0.318841
#2   c.com   200     31  0.563636
#3   c.com   502      9  0.163636
#4   c.com   503     15  0.272727
#5   a.com   200     10  0.434783
#6   a.com   502      5  0.217391
#7   a.com   404      8  0.347826
#8   b.com   200      5  0.625000
#9   b.com   404      2  0.250000
#10  b.com   503      1  0.125000

Playground

Upvotes: 3

Related Questions