Justin Klevs
Justin Klevs

Reputation: 651

Efficiently finding the low median over multiple columns

I am very new to Python, so I assume this is probably a basic question. I found a few solutions online but could not find the exact thing I was looking for. Currently I am searching for a way to find the "low median" over 3 columns of data. If only 2 values of the 3 columns are populated then I would like to take the lower value.

Here is what I have found so far

df['median']=np.nanmedian(df[['val1','val2','val3']], axis=1)

The above was not a workable solution as I did not see any information on an argument to take the low median when there are an even number of values. Additionally I found that there is a function that does what I am looking for

statistics.median_low()

However, I am unsure how to apply it over multiple columns without using some sort of function that calculates each median row-wise one at a time (ie. a loop or apply function). Ideally, I would like a vectorized solution using this function which will calculate the medians simultaneously. Thank you for your assistance.

Upvotes: 2

Views: 901

Answers (2)

Divakar
Divakar

Reputation: 221614

Few optimizations are possible for three columnar data making use of sorting each row and then simply choosing the first or the second column based on the NaNs, which on account of being sorted would be pushed to the end of each row. This lets us use slicing thereafter to do the choosing and get the desired median_low values for each row.

Here's assembling those into a vectorized solution -

a = df.values
a_sorted = np.sort(a,1)
df['median'] = np.where(np.isnan(a_sorted[:,2]), a_sorted[:,0], a_sorted[:,1])

Runtime test

Approaches -

# Proposed in this post
def vectorized_app(df):
    a = df.values
    a_sorted = np.sort(a,1)
    df['median'] = np.where(np.isnan(a_sorted[:,2]), a_sorted[:,0], a_sorted[:,1])
    return df

# @piRSquared's new soln
def vectorized_app2(df):
    v = np.sort(df.values, axis=1)
    n = np.count_nonzero(~np.isnan(v), axis=1)
    j = (n - 1) // 2
    i = np.arange(len(v))
    return df.assign(median_low=v[i, j])

# @piRSquared's old soln
from statistics import median_low
def apply_app(df):
    med = lambda x: median_low(x.dropna())
    return df.apply(med, 1)

Timings -

In [433]: # Setup input dataframe and set one per row as NaN
     ...: np.random.seed(0)
     ...: a = np.random.randint(0,9,(10000,3)).astype(float)
     ...: idx = np.random.randint(0,3,a.shape[0])
     ...: a[np.arange(a.shape[0]), idx] = np.nan
     ...: df = pd.DataFrame(a)
     ...: df.columns = [['val1','val2','val3']]
     ...: 

In [435]: %timeit vectorized_app(df)
1000 loops, best of 3: 481 µs per loop

In [436]: %timeit vectorized_app2(df)
1000 loops, best of 3: 892 µs per loop

In [434]: %timeit apply_app(df)
1 loop, best of 3: 1.15 s per loop

Upvotes: 2

piRSquared
piRSquared

Reputation: 294488

Answer
This is a generalized solution that works for any sized array.

I sort each row, count how many non-nulls, then determine where the median_low must be.

v = np.sort(df.values, axis=1)
n = np.count_nonzero(~np.isnan(v), axis=1)
j = (n - 1) // 2
i = np.arange(len(v))

df.assign(median_low=v[i, j])

   A    B    C  median_low
0  4  5.0  8.0         5.0
1  3  6.0  4.0         4.0
2  4  9.0  NaN         4.0
3  1  NaN  NaN         1.0

Old Answer

First, you'll want to use pd.DataFrame.apply with the axis=1 option to apply the function to each row.

Second, median_low will consider nulls. You don't want that, so make a lambda to drop nulls then use median_low


import pandas as pd
from statistics import median_low

df = pd.DataFrame([
    [4, 5, 8],
    [3, 6, 4],
    [4, 9],
    [1]
], columns=list('ABC'))

med = lambda x: median_low(x.dropna())

df.apply(med, 1)

0    5.0
1    4.0
2    4.0
3    1.0
dtype: float64

Upvotes: 2

Related Questions