Reputation: 1533
I have a pandas dataframe called df_ratings
with about a million rows and 3 columns.
I want to take the data inside this dataframe, apply a transformation on it, and put it inside a numpy matrix called ratings_matrix
I wrote the following code to achieve this:
for i in range(df_ratings.shape[0]): #fill matrix with ratings. zero = unrated
current_user = df_ratings.iloc[i, 0] - 1
current_movie = rated_movies_dictionary[df_ratings.iloc[i, 1]]
current_rating = df_ratings.iloc[i, 2]
ratings_matrix[current_movie, current_user] = current_rating
It works, but very slowly. Iterating over every row of the dataframe in a for loop is slow. Is there a faster way to do this?
Upvotes: 2
Views: 2601
Reputation: 294506
cuser = df_ratings.iloc[:, 0].values - 1
cmvie = df_ratings.iloc[:, 1].map(rated_movies_dictionary).values
crate = df_ratings.iloc[:, 2].values
ratings_matrix[cmvie, cuser] = crate
Response to comment
does the .values add something? – Maarten Fabré
Yes! When doing many things, it is often more performant to use numpy arrays. Since the final goal is to do a slice assignment, I wanted to get everything into numpy arrays. As a simple demonstration, I've run timeit
while slicing with a pandas series and with a numpy array from that series.
%timeit np.arange(4)[pd.Series([1, 2, 3])]
%timeit np.arange(4)[pd.Series([1, 2, 3]).values]
111 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
61.1 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Upvotes: 4