Vectorizing a df.apply() Operation in Pandas

Question

I have a (493,20) pandas dataframe and want to compute a conditional np.nanmean() for each row. The condition is that each value in the row needs to be above a certain threshold and below another. Here's my current setup:

filt_avg_data= np.nanmean(data_tsl.apply(func= lambda x: x[(x < maxval*np.median(x)) & (x > minval*np.median(x))], axis= 1),axis=1)

where maxval: 10, minval: 0.1, and data_tsl.shape= (493,20). This works okay.

However, I want to vectorize this operation - I don't want to use the df.apply() function. I tried

data_tsl>np.median(data_tsl,axis=1) to create a mask of values on which I can perform a np.nanmean() operation on, but it seems as though I can't get each row of data_tsl to correspond to its respective median value. Here is the error that pops up: ValueError: operands could not be broadcast together with shapes (493,2) (493,)

How might I be able to vectorize this operation? Several questions that were similar to this weren't actually asking to vectorize the problem - rather, simply to get the .apply() operation to work.

Divakar · Accepted Answer

If you have NaNs in the input data, I would think you want to use np.nanmedian to ignore NaNs from the median calculation. Going with it, we can use the combined mask for the upper and lower thresholds to set the invalid ones to NaNs as well and finally use np.nanmean -

a = data_tsl.values # use data_tsl.values.copy() to avoid editing input df
med = np.nanmedian(a,axis=1)
U = maxval*med
L = minval*med

a[(a >= U[:,None]) | (a <= L[:,None])] = np.nan
out = np.nanmean(a,axis=1)

Vectorizing a df.apply() Operation in Pandas

Answers (1)

Related Questions