maxpower
maxpower

Reputation: 47

how does numpy find the median in an array/list?

I read, that numpy uses introselect to find the median in an array/ list (https://www.researchgate.net/publication/303755458_Fast_Deterministic_Selection) [page 2; last 5 lines]. But I couldn't find any hints for that in the numpy source code: https://github.com/numpy/numpy/blob/v1.19.0/numpy/lib/function_base.py#L3438-L3525

Does anyone know where I could find the numpy implementation of introselect? Or if numpy doesn't use introselect, what kind of algorithm do the use to find the median?

Many thanks in advance :)

Upvotes: 0

Views: 292

Answers (1)

Surt
Surt

Reputation: 16129

In line 3528 seems to be the main median function. If we cut out all the multidimensional and nan stuff we get something like

def _median(a, axis=None, out=None, overwrite_input=False):
    # can't be reasonably be implemented in terms of percentile as we have to
    # call mean to not break astropy

    # Set the partition indexes
    sz = a.shape
    if sz % 2 == 0:
        szh = sz // 2
        kth = [szh - 1, szh]
    else:
        kth = [(sz - 1) // 2]

    part = partition(a, kth, axis=None)

    return mean(part[indexer], axis=None, out=out)

So partition is doing all the work and comes from

from numpy.core.fromnumeric import (
    ravel, nonzero, partition, mean, any, sum
    )

If we go to the numpy code we get to the following C code.

NPY_SELECTKIND sortkind = NPY_INTROSELECT;

and

val = PyArray_Partition(self, ktharray, axis, sortkind);

Which implemented here and uses

mid = ll + median_of_median5_@suff@(v + ll, hh - ll, NULL, NULL);

So it is introselect.

Once twice the recursion depth is reached the algorithm change to use meadian-of-median5 until the partition is less than 5.

Upvotes: 1

Related Questions