Reputation: 166
I have the following dataframe where I want to assign the bottom 1% value to a new column. When I do this calculation with using the ".loc" notification, it takes around 10 seconds for using .loc assignment, where the alternative solution is only 2 seconds.
df_temp = pd.DataFrame(np.random.randn(100000000,1),columns=list('A'))
%time df_temp["q"] = df_temp["A"].quantile(0.01)
%time df_temp.loc[:, "q1_loc"] = df_temp["A"].quantile(0.01)
Why is the .loc solution slower? I understand using the .loc solution is safer, but if I want to assign data to all indices in the column, what can go wrong with the direct assignment?
Upvotes: 5
Views: 2261
Reputation: 51335
.loc
is searching along the entirety of indices and columns (in this case, only 1 column) in your df along the whole axes, which is time consuming and perhaps redundant, in addition to figuring out the quantiles of df_temp['A']
(which is negligible as far as calculation time). Your direct assignment method, on the other hand, is just parsing df_temp['A'].quantile(0.01)
, and assigning df_temp['q']
. It doesn't need to exhaustively search the indices/columns of your df.
See this answer for a similar description of the .loc
method.
As far as safety is concerned, you are not using chained indexing, so you're probably safe (you're not trying to set anything on a copy of your data, it's being set directly on the data itself). It's good to be aware of the potential issues with not using .loc
(see this post for a nice overview of SettingWithCopy
warnings), but I think that you're OK as far as that goes.
If you want to be more explicit about your column creation, you could do something along the lines of df = df.assign(q=df_temp["A"].quantile(0.01))
. It won't really change performance (I don't think), nor the result, but it allows you to see that you're explicitly assigning a new column to your existing dataframe (and thus not setting anything on a copy of said dataframe).
Upvotes: 3