Optimization - Dataframe aggregation will different filters during aggregation: df.loc or not?

Question

I'm looking to the aggregation below - ideally in a single step. Aggregated columns need to be computed with different filters and I thought of two ways to achieve this (please see functions f1 and f2). I thought that defining an index (as in f2) would speed up the process but it did exactly the opposite - the aggregation takes about 2-3 times longer, irrespective of the dataframe number of rows.

Why is this happening? I thought .loc was the recommended method. Also, is there a third (and faster than f1) method ? I'm using Python 3.6.4.

import numpy as np
import pandas as pd
from collections import OrderedDict
import time

N = 10**5
df_big = pd.DataFrame({'grp': np.array(list(range(1,11)) * N),
                       'vals': np.random.randint(0,100, 10*N),
                       'var1': np.random.randint(10,30, 10*N)})

def f1(x):
    d = OrderedDict()
    d['vals_sum_1'] = np.sum(x['vals'][x['var1'] > 15])
    d['vals_mean_1'] = np.mean(x['vals'][x['var1'] > 15])
    d['vals_median_1'] = np.median(x['vals'][x['var1'] > 15])
    d['vals_sum_2'] = np.sum(x['vals'][x['var1'] > 20])
    d['vals_mean_2'] = np.mean(x['vals'][x['var1'] > 20])
    d['vals_median_2'] = np.median(x['vals'][x['var1'] > 20])    
    return pd.Series(d)

def f2(x):
    d = OrderedDict()
    idx1 = x.loc[x['var1'] > 15].index
    idx2 = x.loc[x['var1'] > 20].index
    d['vals_sum_1'] = np.sum(x['vals'][idx1])
    d['vals_mean_1'] = np.mean(x['vals'][idx1])
    d['vals_median_1'] = np.median(x['vals'][idx1])
    d['vals_sum_2'] = np.sum(x['vals'][idx2])
    d['vals_mean_2'] = np.mean(x['vals'][idx2])
    d['vals_median_2'] = np.median(x['vals'][idx2])   
    return pd.Series(d)  

start_time = time.time()
df_grp_1 = df_big.groupby('grp').apply(f1).reset_index()
gr1_time = time.time()
df_grp_2 = df_big.groupby('grp').apply(f2).reset_index()
gr2_time = time.time()

print("Using aggf1: %s seconds ---" % (gr1_time - start_time))
print("Using aggf2: %s seconds ---" % (gr2_time - gr1_time))

jpp · Accepted Answer

There are many repeated operations. You can see a ~2x factor improvement by removing repeated indexing:

def f3(df):

    g1 = df.loc[df['var1'] > 15].groupby('grp')['vals']
    g2 = df.loc[df['var1'] > 20].groupby('grp')['vals']

    res = pd.DataFrame({'grp': df['grp'].unique()})

    for i, j in enumerate([g1, g2], 1):
        res['vals_sum_'+str(i)] = res['grp'].map(j.sum())
        res['vals_mean_'+str(i)] = res['grp'].map(j.mean())
        res['vals_median_'+str(i)] = res['grp'].map(j.median())

    return res

%timeit df_big.groupby('grp').apply(f1).reset_index()  # 349ms
%timeit df_big.groupby('grp').apply(f2).reset_index()  # 433ms
%timeit f3(df_big)                                     # 183ms

Optimization - Dataframe aggregation will different filters during aggregation: df.loc or not?

Answers (2)

Related Questions