Reputation: 699
I have a dataframe like the following:
Label | Indicator | Value1 | Value2 |
---|---|---|---|
A | 77 | 50 | 50 |
A | 776 | 60 | 70 |
A | 771 | 70 | 40 |
A | 7 | 80 | 50 |
A | 7775 | 90 | 40 |
B | 776 | 100 | 40 |
B | 771 | 41 | 50 |
B | 775 | 54 | 40 |
B | 7775 | 55 | 50 |
What I want is an output like that:
Label | aggregation1 | aggregation2 |
---|---|---|
A | aggregation1_A_value | aggregation2_A_value |
B | aggregation1_B_value | aggregation2_B_value |
Knowing that the way I want to aggregate value is the following (example):
aggregation1 = value1 of indicator starting with 77 (but not 776) - value2 of indicator 776 and 775.
What I am doing now is the following: I split the Indicator into several columns, to have a new data frame:
Label | Indicator0 | Indicator1 | Indicator2 | ... |
---|---|---|---|---|
A | 7 | 77 | 77 | ... |
A | 7 | 77 | 776 | ... |
A | 7 | 77 | 771 | ... |
... | ... | ... | ... | ... |
B | 7 | 77 | 777 | ... |
aggregation1_A = df.query("Label=='A' and Indicator1 is in ["77"] and Indicator2 is not in ["776"]")["value1"].sum()
aggregation1_A -= df.query("Label=='A' and Indicator2 is in ["776","775"]")["value2"].sum()
My issue is that I have more than 70 000 differents labels, and about 20 aggregations to run. Dataframe is 500MB large.
I am wondering if there is any better way. I had a look with pandas UDF and apply a custom aggregation function but I didn't succeed so far.
Thank you for your help
Upvotes: 0
Views: 56
Reputation: 101
You can use get_dummies to replace the step where you split your indicator separate columns. Then you can use those bool values to carry out your aggregations:
dummies = pd.get_dummies(df, columns=['Indicator'])
def agg_1(df):
ret = df.apply(lambda x: x['Value1']*x[['Indicator_77','Indicator_771', 'Indicator_7775']], axis=1).sum().sum()
ret -= df.apply(lambda x: x['Value2']*x[['Indicator_775', 'Indicator_776']], axis=1).sum().sum()
return ret
dummies.groupby('Label').apply([agg_1])
The lambda functions are just multiplying the values by whether or not the relevant indicators are in that row. The sum().sum()
flattens the result of that multiplication into a scalar.
You can put all your aggregation functions in the list with agg_1
.
Upvotes: 1