Optimize nested for-loops in terms of execution time

Question

I want to optimize my code in terms of execution time. The code runs on the dataframe alldata that contains around 300,000 entries, but the computation takes a very long time (around 10 hours or so).

The logic of the computation is the following:

For each missing (nan) value of dataframe's columns specified in the list list_of_NA_features, the function fill_missing_values searches the most similar row (the Cosine similarity is computed based on columns in the list list_of_non_nan_features that are never empty) and returns the value of the current column and row in alldata.

from scipy import spatial

def fill_missing_values(param_nan,current_row,df):
    df_non_nan = df.dropna(subset=[param_nan])
    list_of_non_nan_features = ["f1","f2","f3","f4","f5"] 
    max_val = 0
    searched_val = 0
    vector1 = current_row[list_of_non_nan_features].values
    for index, row in df_non_nan.iterrows():
        vector2 = row[list_of_non_nan_features].values
        sim = 1 - spatial.distance.cosine(vector1, vector2)
        if (sim>max_val):
            max_val = sim
            searched_val = row[param_nan]
    return searched_val


list_of_NA_features = df_train.columns[df_train.isnull().any()]


for feature in list_of_NA_features:
    for index,row in alldata.iterrows():
        if (pd.isnull(row[feature]) == True):
            missing_value = fill_missing_values(feature,row,alldata)
            alldata.ix[index,feature] = missing_value

Is it possible to optimize the code? For instance, I am thinking about the substitution of for loops with lambda functions. Is it possible?

Efron Licht · Accepted Answer

Instead of substituting your for-loops with lambdas, try substituting them with ufuncs.

Losing Your Loops: Fast Numerical Computation with Numpy is an excellent talk by Jake Vanderplass on the subject. Using universal functions and broadcasting instead of for-loops can dramatically improve the speed of your code.

Here is a basic example:

import numpy as np
from time import time

def timed(func):
    def inner(*args, **kwargs):
        t0 = time()
        result = func(*args, **kwargs)
        elapsed = time()-t0
        print(f'ran {func.__name__} in {elapsed} seconds)')
        return result
    return inner
# without broadcasting:

@timed
def sums():
    sums = np.zeros([500, 500])
    for a in range(500):
        for b in range(500):
            sums[a, b] = a+b
    return sums

@timed
def sums_broadcasted(): 
    a = np.arange(500)
    b = np.reshape(np.arange(500), [500, 1])
    return a+b

INPUT:

sums()
sums_broadcasted()
assert (a==b).all()

OUTPUT:

ran sums in 0.030008554458618164 seconds
ran sums_broadcasted in 0.0005011558532714844 seconds

Note by eliminating our loops we have a 60x speedup!

Optimize nested for-loops in terms of execution time

Answers (1)

Related Questions