Evgenii Nikitin
Evgenii Nikitin

Reputation: 229

Vectorizing calculations in pandas

I'm trying to calculate group averages inside of the cross-validation scheme, but this iterating method is extremely slow as my dataframe contains more than 1mln rows. Is it possible to vectorize this calculation? Thanks.

import pandas as pd
import numpy as np
data = np.column_stack([np.arange(1,101), np.random.randint(1,11, 100),np.random.randint(1,101, 100)])
df = pd.DataFrame(data, columns=['id', 'group','total'])
from sklearn.cross_validation import KFold
kf = KFold(df.shape[0], n_folds=3, shuffle = True)
f = {'total': ['mean']}
df['fold'] = 0
df['group_average'] = 0
for train_index, test_index in kf:
    df.ix[train_index, 'fold'] = 0
    df.ix[test_index, 'fold'] = 1
    aux = df.loc[df.fold == 0, :].groupby(['group'])
    aux2 = aux.agg(f)
    aux2.reset_index(inplace = True)
    aux2.columns = ['group', 'group_average']
    for i, row in df.loc[df.fold == 1, :].iterrows():
        new = aux2.ix[(aux2.group == row.group),'group_average']
        if new.empty == True:
            new = 0
        else:
            new = new.values[0]
        df.ix[i, 'group_average'] = new

Upvotes: 3

Views: 531

Answers (1)

Khris
Khris

Reputation: 3212

Replace the for i, row in df.loc[df.fold == 1, :].iterrows():-loop with this:

df0 = pd.merge(df[df.fold == 1],aux2,on='group').set_index('id')
df = df.set_index('id')
df.loc[(df.fold == 1),'group_average'] = df0.loc[:,'group_average_y']
df = df.reset_index()

This gives me the same result as your code and is almost 7 times faster.

Upvotes: 3

Related Questions