Reputation: 651
I am very new to Python, so I assume this is probably a basic question. I found a few solutions online but could not find the exact thing I was looking for. Currently I am searching for a way to find the "low median" over 3 columns of data. If only 2 values of the 3 columns are populated then I would like to take the lower value.
Here is what I have found so far
df['median']=np.nanmedian(df[['val1','val2','val3']], axis=1)
The above was not a workable solution as I did not see any information on an argument to take the low median when there are an even number of values. Additionally I found that there is a function that does what I am looking for
statistics.median_low()
However, I am unsure how to apply it over multiple columns without using some sort of function that calculates each median row-wise one at a time (ie. a loop or apply function). Ideally, I would like a vectorized solution using this function which will calculate the medians simultaneously. Thank you for your assistance.
Upvotes: 2
Views: 901
Reputation: 221614
Few optimizations are possible for three columnar data making use of sorting each row and then simply choosing the first or the second column based on the NaNs
, which on account of being sorted would be pushed to the end of each row. This lets us use slicing
thereafter to do the choosing and get the desired median_low
values for each row.
Here's assembling those into a vectorized solution -
a = df.values
a_sorted = np.sort(a,1)
df['median'] = np.where(np.isnan(a_sorted[:,2]), a_sorted[:,0], a_sorted[:,1])
Runtime test
Approaches -
# Proposed in this post
def vectorized_app(df):
a = df.values
a_sorted = np.sort(a,1)
df['median'] = np.where(np.isnan(a_sorted[:,2]), a_sorted[:,0], a_sorted[:,1])
return df
# @piRSquared's new soln
def vectorized_app2(df):
v = np.sort(df.values, axis=1)
n = np.count_nonzero(~np.isnan(v), axis=1)
j = (n - 1) // 2
i = np.arange(len(v))
return df.assign(median_low=v[i, j])
# @piRSquared's old soln
from statistics import median_low
def apply_app(df):
med = lambda x: median_low(x.dropna())
return df.apply(med, 1)
Timings -
In [433]: # Setup input dataframe and set one per row as NaN
...: np.random.seed(0)
...: a = np.random.randint(0,9,(10000,3)).astype(float)
...: idx = np.random.randint(0,3,a.shape[0])
...: a[np.arange(a.shape[0]), idx] = np.nan
...: df = pd.DataFrame(a)
...: df.columns = [['val1','val2','val3']]
...:
In [435]: %timeit vectorized_app(df)
1000 loops, best of 3: 481 µs per loop
In [436]: %timeit vectorized_app2(df)
1000 loops, best of 3: 892 µs per loop
In [434]: %timeit apply_app(df)
1 loop, best of 3: 1.15 s per loop
Upvotes: 2
Reputation: 294488
Answer
This is a generalized solution that works for any sized array.
I sort each row, count how many non-nulls, then determine where the median_low
must be.
v = np.sort(df.values, axis=1)
n = np.count_nonzero(~np.isnan(v), axis=1)
j = (n - 1) // 2
i = np.arange(len(v))
df.assign(median_low=v[i, j])
A B C median_low
0 4 5.0 8.0 5.0
1 3 6.0 4.0 4.0
2 4 9.0 NaN 4.0
3 1 NaN NaN 1.0
Old Answer
First, you'll want to use pd.DataFrame.apply
with the axis=1
option to apply the function to each row.
Second, median_low
will consider nulls. You don't want that, so make a lambda to drop nulls then use median_low
import pandas as pd
from statistics import median_low
df = pd.DataFrame([
[4, 5, 8],
[3, 6, 4],
[4, 9],
[1]
], columns=list('ABC'))
med = lambda x: median_low(x.dropna())
df.apply(med, 1)
0 5.0
1 4.0
2 4.0
3 1.0
dtype: float64
Upvotes: 2