Reputation: 35
I would like to calculate the median line by line in a dataframe of more than 500,000 rows. For the moment I'm using np.median
because numpy is optimized to run on a single core. It's still very slow and I'd like to find a way to parallel the calculation
Specifically, I have N
tables of size 13 x 500,000
and for each table I want to add the columns Q1, Q3 and median so that for each row the median column contains the median of the row. So I have to calculate N * 500,000
median values.
I tried with numexpr
but it doesn't seem possible.
EDIT : In fact I also need Q1 and Q3 so I can't use the statistics module which doesn't allow to calculate quartiles. Here is how I calculate the median for the moment
q = np.transpose(np.percentile(data[row_array], [25,50,75], axis = 1))
data['Q1_' + family] = q[:,0]
data['MEDIAN_' + family] = q[:,1]
data['Q3_' + family] = q[:,2]
EDIT 2 I solved my problem by using the median of median algorithm as proposed below
Upvotes: 2
Views: 3705
Reputation: 2028
If a (close) approximation of the median is OK for your purposes, you should consider computing a median of medians, which is a divide and conquer strategy that can be executed in parallel. In principle, MoM has O(n)
complexity for serial execution, approaching O(1)
for parallel execution on massively parallel systems.
See this Wiki entry for a description and pseudo-code. See also this question on Stack Overflow and discussion of the code, and this ArXiv paper for a GPU implementation.
Upvotes: 3
Reputation: 6483
From what I understood you want to compute the quantiles row by row.
You can simply transpose your dataframe and then apply pandas.DataFrame.quantile
Not sure this is optimal thou.
q=data.quantile([0.25,0.50,0.75],axis=0)
if you have IPython active you can use the line magic: %time
before the line to check the run time.
%time
q=data.quantile([0.25,0.50,0.75],axis=0)
This returns: Wall time: 0 ns
to me.
Upvotes: 0
Reputation: 9536
Courtesy of @dahhiya_boy
You can use median()
from the statistics
module
import statistics
statistics.median(items)
You can calculate Q1 by taking the median of median()
and min()
, and you can calculate Q3 by taking the median of median()
and max()
. If you find this messy, just define a quartile_median()
function that returns Q1,Q2,Q3
Upvotes: 1