CuriousMind
CuriousMind

Reputation: 15798

Numpy vectorization algorithms to sum up numbers with the same time stamps

I have two arrays P and T. P[i] is a number, whose time stamp is T[i]; There might be duplicated time stamps.

I want to produce another two arrays Q and U, where Q[i] has time stamp U[i], and Q[i] is the sum of all elements in P that have time stamp U[i];

For example, for

P = [1, 2, 3, 4, 5] T = [0, 0, 1, 1, 1]

I will produce

Q = [3, 12] U = [0, 1];

Is there a fast way of doing this in numpy, that hopefully vectorizes it?

Upvotes: 3

Views: 876

Answers (3)

unutbu
unutbu

Reputation: 879133

Using numpy 1.4 or better:

import numpy as np

P = np.array([1, 2, 3, 4, 5]) 
T = np.array([0, 0, 1, 1, 1])

U,inverse = np.unique(T,return_inverse=True)
Q = np.bincount(inverse,weights=P)
print (Q, U)
# (array([  3.,  12.]), array([0, 1]))

Please note that this is not the fastest solution. I tested the speed this way:

import numpy as np

N = 1000
P = np.repeat(np.array([1, 2, 3, 4, 5]),N)
T = np.repeat(np.array([0, 0, 1, 1, 1]),N)

def using_bincount():
    U,inverse = np.unique(T,return_inverse=True)
    Q = np.bincount(inverse,weights=P)
    return Q,U
    # (array([  3.,  12.]), array([0, 1]))

def using_lc():
    U = list(set(T))
    Q = [sum([p for (p,t) in zip(P,T) if t == u]) for u in U]
    return Q,U

def using_slice():
    U = np.unique(T)
    Q = np.array([P[T == u].sum() for u in U])
    return Q,U

For small arrays, wim's solution is faster (N=1):

% python -mtimeit -s'import test' 'test.using_lc()'
100000 loops, best of 3: 18.4 usec per loop
% python -mtimeit -s'import test' 'test.using_slice()'
10000 loops, best of 3: 66.8 usec per loop
% python -mtimeit -s'import test' 'test.using_bincount()'
10000 loops, best of 3: 52.8 usec per loop

For large arrays, joris's solution is faster (N=1000):

% python -mtimeit -s'import test' 'test.using_lc()'
100 loops, best of 3: 9.93 msec per loop
% python -mtimeit -s'import test' 'test.using_slice()'
1000 loops, best of 3: 390 usec per loop
% python -mtimeit -s'import test' 'test.using_bincount()'
1000 loops, best of 3: 846 usec per loop

I doubt it matters in this case, but benchmarks can change depending on version of numpy, python, OS, or hardware. It would not hurt to repeat these benchmarks on your machine.

Upvotes: 4

joris
joris

Reputation: 139142

import numpy as np
P = np.array([1, 2, 3, 4, 5]) 
T = np.array([0, 0, 1, 1, 1])

U = np.unique(T)
Q = np.array([P[T == u].sum() for u in U])

gives

In [17]: print Q, U
[3 12] [0 1]

Not really vectorized, but faster than the solution with lists.

If you want more powerful of this kind of group-by functions, maybe you can take a look at pandas.

Upvotes: 2

wim
wim

Reputation: 362507

>>> P = [1, 2, 3, 4, 5]; T = [0, 0, 1, 1, 1]
>>> U = list(set(T))
>>> Q = [sum([p for (p,t) in zip(P,T) if t == u]) for u in U]
>>> print Q, U
[3, 12] [0, 1]

Upvotes: 1

Related Questions