Reputation: 13
I have the following pandas dataframe called 'production' and would like to obtain the weight of every plant's produce conditional on its species, greenhouse and year_month
plant species greenhouse produce year_month
0001 S1 GH1 200 2020-05
0002 S1 GH1 200 2020-05
0003 S2 GH1 100 2020-05
0004 S2 GH1 50 2020-05
0005 S1 GH2 90 2020-05
0006 S2 GH2 60 2020-05
0007 S1 GH1 150 2020-04
0008 S1 GH2 250 2020-04
0009 S1 GH2 100 2020-04
0010 S2 GH2 150 2020-04
0011 S2 GH2 190 2020-04
0012 S2 GH2 10 2020-04
For example, for '2020-05' we have GH1 and GH2. In GH1, S1 (plant 0001 and 0002) had total produce of 400 (200 + 200), so plant 0001 was responsible for a weight of 0.50 and plant 0002 for 0.50. S2 had total produce of 150 (100 + 50); so to plant 0003 we attribute a weight of 0.66 (100 / 150), and to plant 0004 a weight of 0.33 (50 / 150). And so on.
I would like to generate these 'conditional group-by' weights and assign them as a new column to the dataframe, such that the result looks like so:
plant species greenhouse produce year_month contribution_weight
0001 S1 GH1 200 2020-05 0.50
0002 S1 GH1 200 2020-05 0.50
0003 S2 GH1 100 2020-05 0.66
0004 S2 GH1 50 2020-05 0.33
0005 S1 GH2 90 2020-05 1.00
0006 S2 GH2 60 2020-05 1.00
0007 S1 GH1 150 2020-04 1.00
0008 S1 GH2 250 2020-04 0.71
0009 S1 GH2 100 2020-04 0.29
0010 S2 GH2 150 2020-04 0.42
0011 S2 GH2 190 2020-04 0.54
0012 S2 GH2 10 2020-04 0.02
How can I add the contribution_weight column programmatically?
I have tried using pandas to manually compute each weight, but this is very tedious, prone to error and generates a lot of sub datasets. This is live data, so next month there would be another batch with year_month '2020-06' and there could also be additional GH and species, so I'm looking for a solution that is general and abstract enough to work even with additional labels in each of the columns. Perhaps some function would work?
Upvotes: 1
Views: 72
Reputation: 323226
We can do transform
s=df.groupby(['year_month','greenhouse','species']).produce.transform('sum')
df['New']=df.produce/s
df
plant species greenhouse produce year_month New
0 1 S1 GH1 200 2020-05 0.500000
1 2 S1 GH1 200 2020-05 0.500000
2 3 S2 GH1 100 2020-05 0.666667
3 4 S2 GH1 50 2020-05 0.333333
4 5 S1 GH2 90 2020-05 1.000000
5 6 S2 GH2 60 2020-05 1.000000
6 7 S1 GH1 150 2020-04 1.000000
7 8 S1 GH2 250 2020-04 0.714286
8 9 S1 GH2 100 2020-04 0.285714
9 10 S2 GH2 150 2020-04 0.428571
10 11 S2 GH2 190 2020-04 0.542857
11 12 S2 GH2 10 2020-04 0.028571
Upvotes: 1