Artem Kupriyanov
Artem Kupriyanov

Reputation: 165

NumPy: calculate cumulative median

I have sample with size = n.

I want to calculate for each i: 1 <= i <= n median for sample[:i] in numpy. For example, I counted mean for each i:

cummean = np.cumsum(sample) / np.arange(1, n + 1)

Can I do something similar for the median without cycles and comprehension?

Upvotes: 9

Views: 4251

Answers (5)

yupbank
yupbank

Reputation: 265

There is an approximated solution. if you treat the list arr of values as probability mass function. you can use the np.cumsum(arr) to get the cumulative distribution function. By definition, median is only for half of the probability. which give you an approximated solution

arr[np.searchsorted(np.cumsum(arr), np.cumsum(arr)/2)]

Upvotes: 1

hpaulj
hpaulj

Reputation: 231510

Knowing that Python has a heapq module that lets you keep a running 'minimum' for an iterable, I did a search on heapq and median, and found various items for steaming medium. This one:

http://www.ardendertat.com/2011/11/03/programming-interview-questions-13-median-of-integer-stream/

has a class streamMedian that maintains two heapq, one with the bottom half of the values, the other with top half. The median is either the 'top' of one or the mean of values from both. The class has an insert method and a getMedian method. Most of the work is in the insert.

I copied that into an Ipython session, and defined:

def cummedian_stream(b):
    S=streamMedian()
    ret = []
    for item in b:
        S.insert(item)
        ret.append(S.getMedian())
    return np.array(ret)

Testing:

In [155]: a = np.random.randint(0,100,(5000))
In [156]: amed = cummedian_stream(a)
In [157]: np.allclose(cummedian_sorted(a), amed)
Out[157]: True
In [158]: timeit cummedian_sorted(a)
1 loop, best of 3: 781 ms per loop
In [159]: timeit cummedian_stream(a)
10 loops, best of 3: 39.6 ms per loop

The heapq stream approach is way faster.


The list comprehension that @Uriel gave is relatively slow. But if I substitute np.median for statistics.median it is faster than @Divakar's sorted solution:

def fastloop(a):
    return np.array([np.median(a[:i+1]) for i in range(len(a))])

In [161]: timeit fastloop(a)
1 loop, best of 3: 360 ms per loop

And @Paul Panzer's partition approach is also good, but still slow compared to the streaming class.

In [165]: timeit cummedian_partition(a)
1 loop, best of 3: 391 ms per loop

(I could copy the streamMedian class to this answer if needed).

Upvotes: 7

Paul Panzer
Paul Panzer

Reputation: 53079

Is there room for a late entry?

def cummedian_partition(a):
    n = len(a)
    assert n%4 == 0 # for simplicity
    mn = a.min() - 1
    mx = a.max() + 1
    h = n//2
    N = n + h//2
    work = np.empty((h, N), a.dtype)
    work[:, :n] = a
    work[:, n] = 2*mn - a[0]
    i, j = np.tril_indices(h, -1)
    work[i, n-1-j] = (2*mn - a[1:h+1])[j]
    k, l = np.ogrid[:h, :h//2 - 1]
    work[:, n+1:] = np.where(k > 2*l+1, mx, 2 * mn - mx)
    out = np.partition(work, (N-n//2-1, N-n//2, h//2-1, h//2), axis=-1)
    out = np.r_[2*mn-out[:, h//2: h//2-2:-1], out[::-1, N-n//2-1:N-n//2+1]]
    out[::2, 0] = out[::2, 1]
    return np.mean(out, axis=-1)

The algorithm uses partition which has linear complexity. Some gymnastics are required because np.partition does not support per-line split points. Combined complexity and memory required are quadratic.

Timings compared to current best:

for j in (100, 1000, 5000):
    a = np.random.randint(0, 100, (j,))
    print('size', j)
    print('correct', np.allclose(cummedian_partition(a), cummedian_sorted(a)))
    print('Divakar', timeit(lambda: cummedian_sorted(a), number=10))
    print('PP', timeit(lambda: cummedian_partition(a), number=10))

#  size 100
#  correct True
#  Divakar 0.0022412699763663113
#  PP 0.002393342030700296
#  size 1000
#  correct True
#  Divakar 0.20881508802995086
#  PP 0.10222102201078087
#  size 5000
#  correct True
#  Divakar 6.158387024013791
#  PP 3.437395485001616

Upvotes: 1

Divakar
Divakar

Reputation: 221614

Here's an approach that replicates elements along rows to give us a 2D array. Then, we would fill the upper triangular region with a big number so that later on when we sort the array along each row, would basically sort all elements till the diagonal elements and that simulates the cumulative windows. Then, following the definition of median that chooses the middle one or the mean of two middle ones (for even no. of elements), we would get the elements at the first position : (0,0), then for the second row : mean of (1,0) & (1,1), for the third row : (2,1), for the fourth row : mean of (3,1) & (3,2) and so on. So, we will extract out those elements from the sorted array and thus have our median values.

Thus, the implementation would be -

def cummedian_sorted(a):
    n = a.size
    maxn = a.max()+1
    a_tiled_sorted = np.tile(a,n).reshape(-1,n)
    mask = np.triu(np.ones((n,n),dtype=bool),1)

    a_tiled_sorted[mask] = maxn
    a_tiled_sorted.sort(1)

    all_rows = a_tiled_sorted[np.arange(n), np.arange(n)//2].astype(float)
    idx = np.arange(1,n,2)
    even_rows = a_tiled_sorted[idx, np.arange(1,1+(n//2))]
    all_rows[idx] += even_rows
    all_rows[1::2] /= 2.0
    return all_rows

Runtime test

Approaches -

# Loopy solution from @Uriel's soln   
def cummedian_loopy(arr):
    return [median(a[:i]) for i in range(1,len(a)+1)]

# Nan-fill based solution from @Nickil Maveli's soln   
def cummedian_nanfill(arr):
    a = np.tril(arr).astype(float)
    a[np.triu_indices(a.shape[0], k=1)] = np.nan
    return np.nanmedian(a, axis=1)

Timings -

Set #1 :

In [43]: a = np.random.randint(0,100,(100))

In [44]: print np.allclose(cummedian_loopy(a), cummedian_sorted(a))
    ...: print np.allclose(cummedian_loopy(a), cummedian_nanfill(a))
    ...: 
True
True

In [45]: %timeit cummedian_loopy(a)
    ...: %timeit cummedian_nanfill(a)
    ...: %timeit cummedian_sorted(a)
    ...: 
1000 loops, best of 3: 856 µs per loop
1000 loops, best of 3: 778 µs per loop
10000 loops, best of 3: 200 µs per loop

Set #2 :

In [46]: a = np.random.randint(0,100,(1000))

In [47]: print np.allclose(cummedian_loopy(a), cummedian_sorted(a))
    ...: print np.allclose(cummedian_loopy(a), cummedian_nanfill(a))
    ...: 
True
True

In [48]: %timeit cummedian_loopy(a)
    ...: %timeit cummedian_nanfill(a)
    ...: %timeit cummedian_sorted(a)
    ...: 
10 loops, best of 3: 118 ms per loop
10 loops, best of 3: 47.6 ms per loop
100 loops, best of 3: 18.8 ms per loop

Set #3 :

In [49]: a = np.random.randint(0,100,(5000))

In [50]: print np.allclose(cummedian_loopy(a), cummedian_sorted(a))
    ...: print np.allclose(cummedian_loopy(a), cummedian_nanfill(a))

True
True

In [54]: %timeit cummedian_loopy(a)
    ...: %timeit cummedian_nanfill(a)
    ...: %timeit cummedian_sorted(a)
    ...: 
1 loops, best of 3: 3.36 s per loop
1 loops, best of 3: 583 ms per loop
1 loops, best of 3: 521 ms per loop

Upvotes: 5

Uriel
Uriel

Reputation: 16194

Use statistics.median and cummulative list comprehension (note that odd indices contains medians of even-length lists - where the median is the average of the two median elements, so it usually results with a decimal and not an integer):

>>> from statistics import median
>>> arr = [1, 3, 4, 2, 5, 3, 6]
>>> cum_median = [median(arr[:i+1]) for i in range(len(arr)-1)]
>>> cum_median
[1, 2.0, 3, 2.5, 3, 3.0]

Upvotes: 2

Related Questions