Pdeuxa
Pdeuxa

Reputation: 699

Aggregating and custom function pandas

I have a dataframe like the following:

Label Indicator Value1 Value2
A 77 50 50
A 776 60 70
A 771 70 40
A 7 80 50
A 7775 90 40
B 776 100 40
B 771 41 50
B 775 54 40
B 7775 55 50

What I want is an output like that:

Label aggregation1 aggregation2
A aggregation1_A_value aggregation2_A_value
B aggregation1_B_value aggregation2_B_value

Knowing that the way I want to aggregate value is the following (example):

aggregation1 = value1 of indicator starting with 77 (but not 776) - value2 of indicator 776 and 775.

What I am doing now is the following: I split the Indicator into several columns, to have a new data frame:

Label Indicator0 Indicator1 Indicator2 ...
A 7 77 77 ...
A 7 77 776 ...
A 7 77 771 ...
... ... ... ... ...
B 7 77 777 ...
aggregation1_A = df.query("Label=='A' and Indicator1 is in ["77"] and Indicator2 is not in ["776"]")["value1"].sum()
aggregation1_A -= df.query("Label=='A' and Indicator2 is in ["776","775"]")["value2"].sum()

My issue is that I have more than 70 000 differents labels, and about 20 aggregations to run. Dataframe is 500MB large.

I am wondering if there is any better way. I had a look with pandas UDF and apply a custom aggregation function but I didn't succeed so far.

Thank you for your help

Upvotes: 0

Views: 56

Answers (1)

Michael Jones
Michael Jones

Reputation: 101

You can use get_dummies to replace the step where you split your indicator separate columns. Then you can use those bool values to carry out your aggregations:

dummies = pd.get_dummies(df, columns=['Indicator'])

def agg_1(df):

    ret = df.apply(lambda x: x['Value1']*x[['Indicator_77','Indicator_771', 'Indicator_7775']], axis=1).sum().sum()
    ret -= df.apply(lambda x: x['Value2']*x[['Indicator_775', 'Indicator_776']], axis=1).sum().sum()
    return ret

dummies.groupby('Label').apply([agg_1])

The lambda functions are just multiplying the values by whether or not the relevant indicators are in that row. The sum().sum() flattens the result of that multiplication into a scalar.

You can put all your aggregation functions in the list with agg_1.

Upvotes: 1

Related Questions