Reputation: 27
I'm loosing my mind here.
Ok. I have a larger series but this subset example will do.
As you can see, each category of fruit has 3 types and a 4th that is a total of the three types at the top of each category. You will also note that there are 2 unique codes, 20 and 30, and 30 represents the "total" row and 20 represents an actual type of fruit, so to speak.
I need code that tells me which two fruits had the most produced in 1994 based on the the total of the largest two values of each fruit, excluding the code 30 rows.
Example: for apples, the code would pick the two largest values, excluding the code 30 lines, P.Lady and Mac, and add those. Then it would do the same thing for each fruit. Finally it would give me the two largest values of those sums.
CODE FRUIT TYPE PRD1994 BHT1994 BHT1995 BHT1996
30 apples apples 248487 263500 292200 287700
20 apples P.Lady 123456 130000 132000 131000
20 apples Mac 123789 132000 158400 155000
20 apples Fiji 1242 1500 1800 1700
30 pears pears 288618 310000 372000 362000
20 pears Bartlett 28586 30000 36000 34000
20 pears Brown 157478 160000 192000 189000
20 pears Green 102554 120000 144000 139000
30 oranges oranges 270145 272000 277000 168000
20 oranges Navel 121578 118000 120000 11000
20 oranges Large 15978 14000 15000 12000
20 oranges Small 132589 140000 142000 145000
30 pineapples pineapples 662943 697000 755200 802000
20 pineapples Tall 151564 175000 180000 210000
20 pineapples Short 21634 22000 25200 32000
20 pineapples Mini 489745 500000 550000 560000
30 grapes grapes 2029943 2095001 2412000 13114600
20 grapes Green 1012356 1100000 1300000 12111100
20 grapes Purple 1008586 986000 1100000 989500
20 grapes Sour 9001 9001 12000 14000
30 peaches peaches 704353 507856 903000 712000
20 peaches Large 45789 52000 55000 62000
20 peaches Small 658563 455856 848000 650000
20 peaches Coffee 1 0 0 0
I have tried many, many things. This is what I am hoping to see:
In: ***NO IDEA***
Out: grapes, peaches
Below are some things I have tried and have failed. I moved the code around hoping for some mixture of effects to work and I can't even get to the same point. The closest I got was it showing me a series of data containing the two largest values of each FRUIT based on the PRD1994 column. I think if I could get back there maybe a ['PRD1994'].sum() might be able to finish it off.
def sort_the_fruit():
#return fdf[fdf.CODE == 20] \
#.groupby('FRUIT') \
#.nlargest(2,'PRD1994') \
#.agg({'PRD1994':'max'},ascending=False)
#.groupby('FRUIT','TYPE')
#return fdf.groupby('FRUIT').fdf[5:12].agg(['min', 'max'])
#return fdf.columns[5:12]
#return fdf.groupby(['FRUIT', 'TYPE'])
#return fdf.groupby(['FRUIT','TYPE']).agg({'PRD1994':'max'})
#return fdf.groupby([('CODE' != 30),'FRUIT'])['PRD1994'].sum()
#.groupby(['CODE','FRUIT'])['PRD1994'].sum()
sort_the_fruit()
I used the 1st return of my code to get rid of the 30's. Then when I try to follow it by a .groupby() I get one of two responses:
pandas.core.groupby.DataFrameGroupBy object at 0x7fb888082be0
or
File "<ipython-input-46-266164a56eb1>", line 5
.groupby(['STNAME','CTYNAME'])
^
SyntaxError: invalid syntax
If I have line 2 uncommented, it will give me all the lines minus the 30's. Then if I uncomment line 3, I get the first response or if I uncomment line 6 I get the 2nd response.
Python recommended using .apply but I don't understand how to use .apply and I barely understand .agg to the effect of I might as well not know it exists. I tried using duplicates but it only cares about the 1st, last or all and I need it to care about 2.
I was trying to avoid iteration because of the large data set so I have asked Google so many questions and learned a lot along the way but I still have no idea how to do this. I need help, real bad, and I need you to explain each piece of your code a little so I can follow along.
Thank you for your time and patience, Don
Upvotes: 2
Views: 5332
Reputation: 59274
Not sure if that is what you want. But you can use nlargest()
and groupby
df[df.CODE != 30].groupby('FRUIT').PRD1994.nlargest(2) # same as .groupby.apply(lambda g: g.nlargest(2))
which yields
FRUIT
apples 2 123789
1 123456
grapes 17 1012356
18 1008586
oranges 11 132589
9 121578
peaches 22 658563
21 45789
pears 6 157478
7 102554
pineapples 15 489745
13 151564
Name: PRD1994, dtype: int64
Now, if you want the sum
of these pair of values, just use .sum()
in the groupby
. You would get:
df[df.CODE != 30].groupby('FRUIT').PRD1994.apply(lambda g: g.nlargest(2).sum())
FRUIT
apples 247245
grapes 2020942
oranges 254167
peaches 704352
pears 260032
pineapples 641309
Name: PRD1994, dtype: int64
If you want to get the nlargest
of this series, just use it again :)
df[df.CODE != 30].groupby('FRUIT').PRD1994.apply(lambda g: g.nlargest(2).sum()).nlargest(2)
FRUIT
grapes 2020942
peaches 704352
Name: PRD1994, dtype: int64
Of course, if you want just grapes
and peaches
as output, just get the index of this final pd.Series
and voila ;)
Upvotes: 3