Pierre Carceller
Pierre Carceller

Reputation: 35

How to make the calculation of the median faster in Python

I would like to calculate the median line by line in a dataframe of more than 500,000 rows. For the moment I'm using np.median because numpy is optimized to run on a single core. It's still very slow and I'd like to find a way to parallel the calculation

Specifically, I have N tables of size 13 x 500,000 and for each table I want to add the columns Q1, Q3 and median so that for each row the median column contains the median of the row. So I have to calculate N * 500,000 median values.

I tried with numexpr but it doesn't seem possible.

EDIT : In fact I also need Q1 and Q3 so I can't use the statistics module which doesn't allow to calculate quartiles. Here is how I calculate the median for the moment

    q = np.transpose(np.percentile(data[row_array], [25,50,75], axis = 1))
    data['Q1_' + family] = q[:,0]
    data['MEDIAN_' + family] = q[:,1]
    data['Q3_' + family] = q[:,2]

EDIT 2 I solved my problem by using the median of median algorithm as proposed below

Upvotes: 2

Views: 3705

Answers (3)

MPA
MPA

Reputation: 2028

If a (close) approximation of the median is OK for your purposes, you should consider computing a median of medians, which is a divide and conquer strategy that can be executed in parallel. In principle, MoM has O(n) complexity for serial execution, approaching O(1) for parallel execution on massively parallel systems.

See this Wiki entry for a description and pseudo-code. See also this question on Stack Overflow and discussion of the code, and this ArXiv paper for a GPU implementation.

Upvotes: 3

CAPSLOCK
CAPSLOCK

Reputation: 6483

From what I understood you want to compute the quantiles row by row. You can simply transpose your dataframe and then apply pandas.DataFrame.quantile Not sure this is optimal thou.

q=data.quantile([0.25,0.50,0.75],axis=0)

if you have IPython active you can use the line magic: %time before the line to check the run time.

%time
q=data.quantile([0.25,0.50,0.75],axis=0)

This returns: Wall time: 0 ns to me.

Upvotes: 0

Alec
Alec

Reputation: 9536

Courtesy of @dahhiya_boy

You can use median() from the statistics module

import statistics

statistics.median(items)

You can calculate Q1 by taking the median of median() and min(), and you can calculate Q3 by taking the median of median() and max(). If you find this messy, just define a quartile_median() function that returns Q1,Q2,Q3

Upvotes: 1

Related Questions