Reputation: 8628
I want to optimize my code in terms of execution time. The code runs on the dataframe alldata
that contains around 300,000 entries, but the computation takes a very long time (around 10 hours or so).
The logic of the computation is the following:
For each missing (nan) value of dataframe's columns specified in the list list_of_NA_features
, the function fill_missing_values
searches the most similar row (the Cosine similarity is computed based on columns in the list list_of_non_nan_features
that are never empty) and returns the value of the current column and row in alldata
.
from scipy import spatial
def fill_missing_values(param_nan,current_row,df):
df_non_nan = df.dropna(subset=[param_nan])
list_of_non_nan_features = ["f1","f2","f3","f4","f5"]
max_val = 0
searched_val = 0
vector1 = current_row[list_of_non_nan_features].values
for index, row in df_non_nan.iterrows():
vector2 = row[list_of_non_nan_features].values
sim = 1 - spatial.distance.cosine(vector1, vector2)
if (sim>max_val):
max_val = sim
searched_val = row[param_nan]
return searched_val
list_of_NA_features = df_train.columns[df_train.isnull().any()]
for feature in list_of_NA_features:
for index,row in alldata.iterrows():
if (pd.isnull(row[feature]) == True):
missing_value = fill_missing_values(feature,row,alldata)
alldata.ix[index,feature] = missing_value
Is it possible to optimize the code? For instance, I am thinking about the substitution of for
loops with lambda
functions. Is it possible?
Upvotes: 0
Views: 1316
Reputation: 498
Instead of substituting your for-loops with lambdas
, try substituting them with ufuncs.
Losing Your Loops: Fast Numerical Computation with Numpy is an excellent talk by Jake Vanderplass on the subject. Using universal functions and broadcasting instead of for-loops can dramatically improve the speed of your code.
Here is a basic example:
import numpy as np
from time import time
def timed(func):
def inner(*args, **kwargs):
t0 = time()
result = func(*args, **kwargs)
elapsed = time()-t0
print(f'ran {func.__name__} in {elapsed} seconds)')
return result
return inner
# without broadcasting:
@timed
def sums():
sums = np.zeros([500, 500])
for a in range(500):
for b in range(500):
sums[a, b] = a+b
return sums
@timed
def sums_broadcasted():
a = np.arange(500)
b = np.reshape(np.arange(500), [500, 1])
return a+b
INPUT:
sums()
sums_broadcasted()
assert (a==b).all()
OUTPUT:
ran sums in 0.030008554458618164 seconds
ran sums_broadcasted in 0.0005011558532714844 seconds
Note by eliminating our loops we have a 60x speedup!
Upvotes: 1