Looking for faster way to iterate over pandas dataframe

Question

I have a pandas dataframe called df_ratings with about a million rows and 3 columns.

I want to take the data inside this dataframe, apply a transformation on it, and put it inside a numpy matrix called ratings_matrix

I wrote the following code to achieve this:

for i in range(df_ratings.shape[0]): #fill matrix with ratings. zero = unrated
    current_user = df_ratings.iloc[i, 0] - 1
    current_movie = rated_movies_dictionary[df_ratings.iloc[i, 1]]
    current_rating = df_ratings.iloc[i, 2]

    ratings_matrix[current_movie, current_user] = current_rating

It works, but very slowly. Iterating over every row of the dataframe in a for loop is slow. Is there a faster way to do this?

piRSquared · Accepted Answer

cuser = df_ratings.iloc[:, 0].values - 1
cmvie = df_ratings.iloc[:, 1].map(rated_movies_dictionary).values
crate = df_ratings.iloc[:, 2].values
ratings_matrix[cmvie, cuser] = crate

Response to comment

does the .values add something? – Maarten Fabré

Yes! When doing many things, it is often more performant to use numpy arrays. Since the final goal is to do a slice assignment, I wanted to get everything into numpy arrays. As a simple demonstration, I've run timeit while slicing with a pandas series and with a numpy array from that series.

%timeit np.arange(4)[pd.Series([1, 2, 3])]
%timeit np.arange(4)[pd.Series([1, 2, 3]).values]

111 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
61.1 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Looking for faster way to iterate over pandas dataframe

Answers (1)

Related Questions