Don of the Dead
Don of the Dead

Reputation: 27

Pandas: How to sum values in a column for duplicate rows

I'm loosing my mind here.

Ok. I have a larger series but this subset example will do.

As you can see, each category of fruit has 3 types and a 4th that is a total of the three types at the top of each category. You will also note that there are 2 unique codes, 20 and 30, and 30 represents the "total" row and 20 represents an actual type of fruit, so to speak.

I need code that tells me which two fruits had the most produced in 1994 based on the the total of the largest two values of each fruit, excluding the code 30 rows.

Example: for apples, the code would pick the two largest values, excluding the code 30 lines, P.Lady and Mac, and add those. Then it would do the same thing for each fruit. Finally it would give me the two largest values of those sums.

CODE    FRUIT       TYPE        PRD1994 BHT1994 BHT1995 BHT1996         
30      apples      apples      248487  263500  292200  287700
20      apples      P.Lady      123456  130000  132000  131000
20      apples      Mac         123789  132000  158400  155000
20      apples      Fiji        1242    1500    1800    1700
30      pears       pears       288618  310000  372000  362000
20      pears       Bartlett    28586   30000   36000   34000
20      pears       Brown       157478  160000  192000  189000
20      pears       Green       102554  120000  144000  139000
30      oranges     oranges     270145  272000  277000  168000
20      oranges     Navel       121578  118000  120000  11000
20      oranges     Large       15978   14000   15000   12000
20      oranges     Small       132589  140000  142000  145000
30      pineapples  pineapples  662943  697000  755200  802000
20      pineapples  Tall        151564  175000  180000  210000
20      pineapples  Short       21634   22000   25200   32000
20      pineapples  Mini        489745  500000  550000  560000
30      grapes      grapes      2029943 2095001 2412000 13114600
20      grapes      Green       1012356 1100000 1300000 12111100
20      grapes      Purple      1008586 986000  1100000 989500
20      grapes      Sour        9001    9001    12000   14000
30      peaches     peaches     704353  507856  903000  712000
20      peaches     Large       45789   52000   55000   62000
20      peaches     Small       658563  455856  848000  650000
20      peaches     Coffee      1       0       0       0

I have tried many, many things. This is what I am hoping to see:

In: ***NO IDEA***
Out: grapes, peaches

Below are some things I have tried and have failed. I moved the code around hoping for some mixture of effects to work and I can't even get to the same point. The closest I got was it showing me a series of data containing the two largest values of each FRUIT based on the PRD1994 column. I think if I could get back there maybe a ['PRD1994'].sum() might be able to finish it off.

def sort_the_fruit():
    #return fdf[fdf.CODE == 20] \
    #.groupby('FRUIT') \
    #.nlargest(2,'PRD1994') \
    #.agg({'PRD1994':'max'},ascending=False)
    #.groupby('FRUIT','TYPE')
    #return fdf.groupby('FRUIT').fdf[5:12].agg(['min', 'max'])
    #return fdf.columns[5:12]
    #return fdf.groupby(['FRUIT', 'TYPE'])
    #return fdf.groupby(['FRUIT','TYPE']).agg({'PRD1994':'max'})
    #return fdf.groupby([('CODE' != 30),'FRUIT'])['PRD1994'].sum()
    #.groupby(['CODE','FRUIT'])['PRD1994'].sum()

sort_the_fruit()

I used the 1st return of my code to get rid of the 30's. Then when I try to follow it by a .groupby() I get one of two responses:

pandas.core.groupby.DataFrameGroupBy object at 0x7fb888082be0

or

File "<ipython-input-46-266164a56eb1>", line 5
.groupby(['STNAME','CTYNAME'])
^
SyntaxError: invalid syntax

If I have line 2 uncommented, it will give me all the lines minus the 30's. Then if I uncomment line 3, I get the first response or if I uncomment line 6 I get the 2nd response.

Python recommended using .apply but I don't understand how to use .apply and I barely understand .agg to the effect of I might as well not know it exists. I tried using duplicates but it only cares about the 1st, last or all and I need it to care about 2.

I was trying to avoid iteration because of the large data set so I have asked Google so many questions and learned a lot along the way but I still have no idea how to do this. I need help, real bad, and I need you to explain each piece of your code a little so I can follow along.

Thank you for your time and patience, Don

Upvotes: 2

Views: 5332

Answers (1)

rafaelc
rafaelc

Reputation: 59274

Not sure if that is what you want. But you can use nlargest() and groupby

df[df.CODE != 30].groupby('FRUIT').PRD1994.nlargest(2) # same as .groupby.apply(lambda g: g.nlargest(2))

which yields

FRUIT         
apples      2      123789
            1      123456
grapes      17    1012356
            18    1008586
oranges     11     132589
            9      121578
peaches     22     658563
            21      45789
pears       6      157478
            7      102554
pineapples  15     489745
            13     151564
Name: PRD1994, dtype: int64

Now, if you want the sum of these pair of values, just use .sum() in the groupby. You would get:

df[df.CODE != 30].groupby('FRUIT').PRD1994.apply(lambda g: g.nlargest(2).sum())

FRUIT
apples         247245
grapes        2020942
oranges        254167
peaches        704352
pears          260032
pineapples     641309
Name: PRD1994, dtype: int64

If you want to get the nlargest of this series, just use it again :)

df[df.CODE != 30].groupby('FRUIT').PRD1994.apply(lambda g: g.nlargest(2).sum()).nlargest(2)

FRUIT
grapes     2020942
peaches     704352
Name: PRD1994, dtype: int64

Of course, if you want just grapes and peaches as output, just get the index of this final pd.Series and voila ;)

Upvotes: 3

Related Questions