Vectorizing calculations in pandas

Question

I'm trying to calculate group averages inside of the cross-validation scheme, but this iterating method is extremely slow as my dataframe contains more than 1mln rows. Is it possible to vectorize this calculation? Thanks.

import pandas as pd
import numpy as np
data = np.column_stack([np.arange(1,101), np.random.randint(1,11, 100),np.random.randint(1,101, 100)])
df = pd.DataFrame(data, columns=['id', 'group','total'])
from sklearn.cross_validation import KFold
kf = KFold(df.shape[0], n_folds=3, shuffle = True)
f = {'total': ['mean']}
df['fold'] = 0
df['group_average'] = 0
for train_index, test_index in kf:
    df.ix[train_index, 'fold'] = 0
    df.ix[test_index, 'fold'] = 1
    aux = df.loc[df.fold == 0, :].groupby(['group'])
    aux2 = aux.agg(f)
    aux2.reset_index(inplace = True)
    aux2.columns = ['group', 'group_average']
    for i, row in df.loc[df.fold == 1, :].iterrows():
        new = aux2.ix[(aux2.group == row.group),'group_average']
        if new.empty == True:
            new = 0
        else:
            new = new.values[0]
        df.ix[i, 'group_average'] = new

Khris · Accepted Answer

Replace the for i, row in df.loc[df.fold == 1, :].iterrows():-loop with this:

df0 = pd.merge(df[df.fold == 1],aux2,on='group').set_index('id')
df = df.set_index('id')
df.loc[(df.fold == 1),'group_average'] = df0.loc[:,'group_average_y']
df = df.reset_index()

This gives me the same result as your code and is almost 7 times faster.

Vectorizing calculations in pandas

Answers (1)

Related Questions