Reputation: 229
I'm trying to calculate group averages inside of the cross-validation scheme, but this iterating method is extremely slow as my dataframe contains more than 1mln rows. Is it possible to vectorize this calculation? Thanks.
import pandas as pd
import numpy as np
data = np.column_stack([np.arange(1,101), np.random.randint(1,11, 100),np.random.randint(1,101, 100)])
df = pd.DataFrame(data, columns=['id', 'group','total'])
from sklearn.cross_validation import KFold
kf = KFold(df.shape[0], n_folds=3, shuffle = True)
f = {'total': ['mean']}
df['fold'] = 0
df['group_average'] = 0
for train_index, test_index in kf:
df.ix[train_index, 'fold'] = 0
df.ix[test_index, 'fold'] = 1
aux = df.loc[df.fold == 0, :].groupby(['group'])
aux2 = aux.agg(f)
aux2.reset_index(inplace = True)
aux2.columns = ['group', 'group_average']
for i, row in df.loc[df.fold == 1, :].iterrows():
new = aux2.ix[(aux2.group == row.group),'group_average']
if new.empty == True:
new = 0
else:
new = new.values[0]
df.ix[i, 'group_average'] = new
Upvotes: 3
Views: 531
Reputation: 3212
Replace the for i, row in df.loc[df.fold == 1, :].iterrows():
-loop with this:
df0 = pd.merge(df[df.fold == 1],aux2,on='group').set_index('id')
df = df.set_index('id')
df.loc[(df.fold == 1),'group_average'] = df0.loc[:,'group_average_y']
df = df.reset_index()
This gives me the same result as your code and is almost 7 times faster.
Upvotes: 3