Efficiently finding the low median over multiple columns

Question

I am very new to Python, so I assume this is probably a basic question. I found a few solutions online but could not find the exact thing I was looking for. Currently I am searching for a way to find the "low median" over 3 columns of data. If only 2 values of the 3 columns are populated then I would like to take the lower value.

Here is what I have found so far

df['median']=np.nanmedian(df[['val1','val2','val3']], axis=1)

The above was not a workable solution as I did not see any information on an argument to take the low median when there are an even number of values. Additionally I found that there is a function that does what I am looking for

statistics.median_low()

However, I am unsure how to apply it over multiple columns without using some sort of function that calculates each median row-wise one at a time (ie. a loop or apply function). Ideally, I would like a vectorized solution using this function which will calculate the medians simultaneously. Thank you for your assistance.

Divakar · Accepted Answer

Few optimizations are possible for three columnar data making use of sorting each row and then simply choosing the first or the second column based on the NaNs, which on account of being sorted would be pushed to the end of each row. This lets us use slicing thereafter to do the choosing and get the desired median_low values for each row.

Here's assembling those into a vectorized solution -

a = df.values
a_sorted = np.sort(a,1)
df['median'] = np.where(np.isnan(a_sorted[:,2]), a_sorted[:,0], a_sorted[:,1])

Runtime test

Approaches -

# Proposed in this post
def vectorized_app(df):
    a = df.values
    a_sorted = np.sort(a,1)
    df['median'] = np.where(np.isnan(a_sorted[:,2]), a_sorted[:,0], a_sorted[:,1])
    return df

# @piRSquared's new soln
def vectorized_app2(df):
    v = np.sort(df.values, axis=1)
    n = np.count_nonzero(~np.isnan(v), axis=1)
    j = (n - 1) // 2
    i = np.arange(len(v))
    return df.assign(median_low=v[i, j])

# @piRSquared's old soln
from statistics import median_low
def apply_app(df):
    med = lambda x: median_low(x.dropna())
    return df.apply(med, 1)

Timings -

In [433]: # Setup input dataframe and set one per row as NaN
     ...: np.random.seed(0)
     ...: a = np.random.randint(0,9,(10000,3)).astype(float)
     ...: idx = np.random.randint(0,3,a.shape[0])
     ...: a[np.arange(a.shape[0]), idx] = np.nan
     ...: df = pd.DataFrame(a)
     ...: df.columns = [['val1','val2','val3']]
     ...: 

In [435]: %timeit vectorized_app(df)
1000 loops, best of 3: 481 µs per loop

In [436]: %timeit vectorized_app2(df)
1000 loops, best of 3: 892 µs per loop

In [434]: %timeit apply_app(df)
1 loop, best of 3: 1.15 s per loop

Efficiently finding the low median over multiple columns

Answers (2)

Related Questions