Samet
Samet

Reputation: 11

How to use vectorization instead of for loop in pandas

I'm trying to build a machine learning algorithm for my job. The data I'm using for training and testing has 17k rows and 20 columns. I've tried adding a new column based on two other columns but the for loop that I've written is too slow (3 seconds to be executed)

for i in range(0, len(model_olculeri)):
    if (model_olculeri["Bel"][i] != 0) and (model_olculeri["Basen"][i] != 0):
        sum_column = (model_olculeri["Bel"][i]) / (model_olculeri["Basen"][i])
        model_olculeri["Waist to Hip Ratio"][i] = sum_column

I read articles about pandas and numpy vectorization instead of for loop on pandas dataframes and it seems like it is so much faster and effective. How can I implement this kind of vectorization for my for loop? Thanks a lot.

Upvotes: 0

Views: 511

Answers (2)

user16386186
user16386186

Reputation:

Chained solution using query and pipe

model_olculeri.query("Bel != 0 & Basen != 0").pipe(lambda x:x.assign(Waist to Hip Ratio =  x.Bel/x.Basen)

Upvotes: 0

jezrael
jezrael

Reputation: 863611

Create boolean mask and use it for filtering:

m = (model_olculeri["Bel"] != 0) & (model_olculeri["Basen"] != 0)
model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri.loc[m, "Bel"] / model_olculeri.loc[m,"Basen"]

Alternative:

model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri["Bel"] / model_olculeri["Basen"]

Or set new value in numpy.where:

model_olculeri["Waist to Hip Ratio"] = np.where(m, model_olculeri["Bel"] / model_olculeri["Basen"], np.nan)

Upvotes: 1

Related Questions