Reputation: 487
I'm looping thru a dataset trying to find values in an index. If the value isn't in the (multi-level) index, then I take a default value:
for i,row in df.iterrows():
if i in avg_days.index:
df.at[i,'avg_days_to_n'] = round(avg_days.xs(i,axis=0)[1],0)
else:
df.at[i,'avg_days_to_n'] = round(avg_days.xs((i[0],i[1]),level (0,1),drop_level=False).avg_days_between_n.mean(),0)
is there a way to vectorize this?
Here is what the avg_days dataframe looks like:
avg_n
state project_type cat
FL R Hur 33
CA C EE 43
Here is what df looks like:
id
state project_type cat
FL R Hur abc-1
CA C EE def-2
So, for each row in df, if the index state-project_type-cat has an index match in avg_days, I return the avg_n value.
If not, i "move up" the index and take the state-project type (and NOT cat) value as i know that will return something
Upvotes: 1
Views: 1268
Reputation: 8813
The first part, matching all 3 indexes at the same time can be easily achieved with the merge function, which is implemented in pandas. In order to maintain all the indexes found in df
even when they are not present in avg_days
, the how
argument must be set to left
:
merged_df = pd.merge(df,avg_days,right_index=True,left_index=True,how='left')
The second part is trickier than that. After the merge, there will be indexes without any value in the avg_n
value of the merged_df
dataframe, which will be the ones that have to be set to the avg_n
mean of the first 2 indexes.
The mean values can be calculated with the groupby function, and then the indexes of the list that need to be corrected can be obtained checking for NaN
.
days_grouped = np.around(avg_days.groupby(['state','project_type']).mean())
nan_mask = ~np.isfinite(merged_df['avg_n'])
indexes2correct = [(s,p) for s,p,c in merged_df[nan_mask].index.values]
Finally, the rows containing NaN
can be updated using the values in days_gruped
:
merged_df.loc[nan_mask,'avg_n'] = days_grouped.loc[indexes2correct]['avg_n'].values
Note: I compared the results obtained with the code shown in the question and this way of vectorizing, and there are some differences that should be taken into account:
avg_days['avg_n']
are assumed to be integers already.numpy.round
instead of the built-in round
, which in some cases round to different values (see this other question for more information). Upvotes: 2