Derek Krantz
Derek Krantz

Reputation: 487

vectorize index lookup in pandas

I'm looping thru a dataset trying to find values in an index. If the value isn't in the (multi-level) index, then I take a default value:

for i,row in df.iterrows():
    if i in avg_days.index:
        df.at[i,'avg_days_to_n'] = round(avg_days.xs(i,axis=0)[1],0)
    else:
        df.at[i,'avg_days_to_n'] = round(avg_days.xs((i[0],i[1]),level (0,1),drop_level=False).avg_days_between_n.mean(),0)

is there a way to vectorize this?

Here is what the avg_days dataframe looks like:

                            avg_n
state   project_type    cat 
FL      R               Hur 33
CA      C               EE  43

Here is what df looks like:

                            id
state   project_type    cat 
FL      R               Hur abc-1
CA      C               EE  def-2

So, for each row in df, if the index state-project_type-cat has an index match in avg_days, I return the avg_n value.

If not, i "move up" the index and take the state-project type (and NOT cat) value as i know that will return something

Upvotes: 1

Views: 1268

Answers (1)

OriolAbril
OriolAbril

Reputation: 8813

The first part, matching all 3 indexes at the same time can be easily achieved with the merge function, which is implemented in pandas. In order to maintain all the indexes found in df even when they are not present in avg_days, the how argument must be set to left:

merged_df = pd.merge(df,avg_days,right_index=True,left_index=True,how='left')

The second part is trickier than that. After the merge, there will be indexes without any value in the avg_n value of the merged_df dataframe, which will be the ones that have to be set to the avg_n mean of the first 2 indexes.

The mean values can be calculated with the groupby function, and then the indexes of the list that need to be corrected can be obtained checking for NaN.

days_grouped = np.around(avg_days.groupby(['state','project_type']).mean())
nan_mask = ~np.isfinite(merged_df['avg_n'])
indexes2correct = [(s,p) for s,p,c in merged_df[nan_mask].index.values]

Finally, the rows containing NaN can be updated using the values in days_gruped:

merged_df.loc[nan_mask,'avg_n'] = days_grouped.loc[indexes2correct]['avg_n'].values 

Note: I compared the results obtained with the code shown in the question and this way of vectorizing, and there are some differences that should be taken into account:

  • Here, only the means are rounded, and the values in avg_days['avg_n'] are assumed to be integers already.
  • The rounding is done with numpy.round instead of the built-in round, which in some cases round to different values (see this other question for more information).

Upvotes: 2

Related Questions