Shannon
Shannon

Reputation: 458

NumPy maxima of groups defined by a label array

I have two arrays, one is a list of values and one is a list of IDs corresponding to each value. Some IDs have multiple values. I want to create a new array that contains the maximum value recorded for each id, which will have a length equal to the number of unique ids.

Example using a for loop:

import numpy as np

values = np.array([5, 3, 2, 6, 3, 4, 8, 2, 4, 8])
ids = np.array([0, 1, 3, 3, 3, 3, 5, 6, 6, 6])

uniq_ids = np.unique(ids)
maximums = np.ones_like(uniq_ids) * np.nan

for i, id in enumerate(uniq_ids):
    maximums[i] = np.max(values[np.where(ids == id)])

print(uniq_ids)
print(maximums)
[0 1 3 5 6]
[5. 3. 6. 8. 8.]

Is it possible to vectorize this so it runs fast? I'm imagining a one-liner that can create the "maximums" array using only NumPy functions, but I haven't been able to come up with anything that works.

Upvotes: 5

Views: 507

Answers (6)

mathfux
mathfux

Reputation: 5949

np.lexsort sorts by multiple columns. However, this is not compulsory. You can sort ids first and then choose maximum item of each divided group using numpy.maximum.reduceat

def mathfux(values, ids, return_groups=False):
    argidx = np.argsort(ids) #70% time
    ids_sort, values_sort = ids[argidx], values[argidx] #4% time
    div_points = np.r_[0, np.flatnonzero(np.diff(ids_sort)) + 1] #11% time (the most part for np.flatnonzero)
    if return_groups: 
        return ids[div_points], np.maximum.reduceat(values_sort, div_points)
    else: return np.maximum.reduceat(values_sort, div_points)

mathfux(values, ids, return_groups=True)
>>> (array([0, 1, 3, 5, 6]), array([5, 3, 6, 8, 8]))
mathfux(values, ids)
>>> mathfux(values, ids)
array([5, 3, 6, 8, 8])

Usually, some parts of numpy codes could be optimised further in numba. Note that np.argsort is a bottleneck in majority of groupby problems which can't be replaced by any other method. It is unlikely to be improved soon in numba or numpy. So you are reaching an optimal performance here and can't do much in further optimisations.

Upvotes: 4

Mad Physicist
Mad Physicist

Reputation: 114320

Use np.lexsort to sort both lists simultaneously:

idx = np.lexsort([values, ids])

The indices of the last unique ID are given by

last = np.r_[np.flatnonzero(np.diff(ids[idx])), len(ids) - 1]

You can use this to get the maxima of each group:

values[idx[last]]

This is the same as values[idx][last], but faster because you only need to extract len(last) elements this way, instead of rearranging the whole array and then extracting.

Keep in mind that np.unique is basically doing sort and flatnonzero when you do return_index=True.

Here's a collection of all the solutions so far, along with a benchmark. I suggest a modified solution to @bb1's pandas recipe implemented below.

def Shannon(values, ids):
    uniq_ids = np.unique(ids)
    maxima = np.ones_like(uniq_ids) * np.nan
    for i, id in enumerate(uniq_ids):
        maxima[i] = np.max(values[np.where(ids == id)])
    return maxima

def richardec(values, ids):
    return [a.max() for a in np.split(values, np.arange(1, ids.shape[0])[(np.diff(ids) != 0)])]

def MadPhysicist(values, ids):
    idx = np.lexsort([values, ids])
    return values[idx[np.r_[np.flatnonzero(np.diff(ids[idx])), len(ids) - 1]]]

def PeptideWitch(values, ids):
    return np.vectorize(lambda id: np.max(values[np.where(ids == id)]))(np.unique(ids))

def mathfux(values, ids):
    idx = np.argsort(ids)
    return np.maximum.reduceat(values[idx], np.r_[0, np.flatnonzero(np.diff(ids[idx])) + 1])

def bb1(values, ids):
    return DataFrame({'ids': ids, 'vals': values}).groupby('ids')['values'].max().to_numpy()

def bb1_modified(values, ids):
    return Series(values).groupby(ids).max().to_numpy()
values = np.array([5, 3, 2, 6, 3, 4, 8, 2, 4, 8])
ids = np.array([0, 1, 3, 3, 3, 3, 5, 6, 6, 6])

%timeit Shannon(values, ids)
42.1 µs ± 561 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit richardec(values, ids)
27.7 µs ± 181 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit MadPhysicist(values, ids)
19.3 µs ± 268 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit PeptideWitch(values, ids)
55.9 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit mathfux(values, ids)
20.9 µs ± 308 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit bb1(values, ids)
865 µs ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bb1_modified(values, ids)
537 µs ± 3.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
values = np.random.randint(10000, size=10000)
ids = np.random.randint(100, size=10000)

%timeit Shannon(values, ids)
1.76 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit richardec(values, ids)
29.1 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit MadPhysicist(values, ids)
904 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit PeptideWitch(values, ids)
1.74 ms ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit mathfux(values, ids)
372 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bb1(values, ids)
964 µs ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bb1_modified(values, ids)
679 µs ± 7.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@mathfux's suggestion of using np.maximum.reduceat is by far the fastest all around, followed by the modified pandas solution and then lexsorting. The pandas solution is likely very similar to the reduceat one, but with a lot more overhead. You can see that the incremental change for the modified pandas solution is tiny because of that.

For large arrays (which are the only ones that matter), it seems that finding the maxima individually is comparable to the question, while richardec's comprehension is much slower, despite the fact that it does not sort the input values according to the IDs.

Upvotes: 5

bb1
bb1

Reputation: 7863

With pandas:

import pandas as pd

def with_pandas(ids, vals):
    df = pd.DataFrame({'ids': ids, 'vals': values})
    return df.groupby('ids')['vals'].max().to_numpy()

Timing:

import numpy as np

values = np.random.randint(10000, size=10000)
ids = np.random.randint(100, size=10000)

%timeit with_pandas(ids, values)
692 µs ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 1

PeptideWitch
PeptideWitch

Reputation: 2349

EDIT: I'll leave this up as an example of why we can't always use np.vectorize() to make everything magically faster:

One solution is to use numpy's vectorize function:

import numpy as np

values = np.array([5, 3, 2, 6, 3, 4, 8, 2, 4, 8])
ids = np.array([0, 1, 3, 3, 3, 3, 5, 6, 6, 6])

def my_func(id):
    return np.max(values[np.where(ids==id)])

vector_func = np.vectorize(my_func)
maximums = vector_func(np.unique(ids))

which returns

array([5, 3, 6, 8, 8])

But as for speed, your version has about the same performance when we use

values = np.array([random.randint(1, 100) for i in range(1000000)])
ids = []
for i in range(100000):
    r = random.randint(1, 4)
    if r == 3:
        for x in range(3):
            ids.append(i)
    elif r == 2:
        for x in range(4):
            ids.append(i)
    else:
        ids.append(i)
ids = np.array(ids)

It's about 12 seconds per execution.

Upvotes: 1

hpaulj
hpaulj

Reputation: 231395

In trying to visualize the problem:

In [82]: [np.where(ids==id) for id in uniq_ids]
Out[82]: 
[(array([0]),),
 (array([1]),),
 (array([2, 3, 4, 5]),),
 (array([6]),),
 (array([7, 8, 9]),)]

unique can also return:

In [83]: np.unique(ids, return_inverse=True)
Out[83]: (array([0, 1, 3, 5, 6]), array([0, 1, 2, 2, 2, 2, 3, 4, 4, 4]))

Which is a variante on what richardec produced:

In [88]: [a for a in np.split(ids, np.arange(1, ids.shape[0])[(np.diff(ids) !=
    ...: 0)])]
Out[88]: [array([0]), array([1]), array([3, 3, 3, 3]), array([5]), array([6, 6, 6])]

That inverse is also produced by doing where on all == at once:

In [90]: ids[:,None] == uniq_ids
Out[90]: 
array([[ True, False, False, False, False],
       [False,  True, False, False, False],
       [False, False,  True, False, False],
       [False, False,  True, False, False],
       [False, False,  True, False, False],
       [False, False,  True, False, False],
       [False, False, False,  True, False],
       [False, False, False, False,  True],
       [False, False, False, False,  True],
       [False, False, False, False,  True]])
In [91]: np.nonzero(ids[:,None] == uniq_ids)
Out[91]: (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1, 2, 2, 2, 2, 3, 4, 4, 4]))

I'm still thinking through this ...

Upvotes: 1

user17242583
user17242583

Reputation:

Here's a solution, which, although not 100% vectorized, (per my benchmarks) takes about half the time as your does (using your sample data). The performance improvement probably becomes more drastic with more data:

maximums = [a.max() for a in np.split(values, np.arange(1, ids.shape[0])[(np.diff(ids) != 0)])]

Output:

>>> maximums
[5, 3, 6, 8, 8]

Upvotes: 2

Related Questions