Reputation: 458
I have two arrays, one is a list of values and one is a list of IDs corresponding to each value. Some IDs have multiple values. I want to create a new array that contains the maximum value recorded for each id, which will have a length equal to the number of unique ids.
Example using a for
loop:
import numpy as np
values = np.array([5, 3, 2, 6, 3, 4, 8, 2, 4, 8])
ids = np.array([0, 1, 3, 3, 3, 3, 5, 6, 6, 6])
uniq_ids = np.unique(ids)
maximums = np.ones_like(uniq_ids) * np.nan
for i, id in enumerate(uniq_ids):
maximums[i] = np.max(values[np.where(ids == id)])
print(uniq_ids)
print(maximums)
[0 1 3 5 6]
[5. 3. 6. 8. 8.]
Is it possible to vectorize this so it runs fast? I'm imagining a one-liner that can create the "maximums" array using only NumPy functions, but I haven't been able to come up with anything that works.
Upvotes: 5
Views: 507
Reputation: 5949
np.lexsort
sorts by multiple columns. However, this is not compulsory. You can sort ids
first and then choose maximum item of each divided group using numpy.maximum.reduceat
def mathfux(values, ids, return_groups=False):
argidx = np.argsort(ids) #70% time
ids_sort, values_sort = ids[argidx], values[argidx] #4% time
div_points = np.r_[0, np.flatnonzero(np.diff(ids_sort)) + 1] #11% time (the most part for np.flatnonzero)
if return_groups:
return ids[div_points], np.maximum.reduceat(values_sort, div_points)
else: return np.maximum.reduceat(values_sort, div_points)
mathfux(values, ids, return_groups=True)
>>> (array([0, 1, 3, 5, 6]), array([5, 3, 6, 8, 8]))
mathfux(values, ids)
>>> mathfux(values, ids)
array([5, 3, 6, 8, 8])
Usually, some parts of numpy
codes could be optimised further in numba
. Note that np.argsort
is a bottleneck in majority of groupby problems which can't be replaced by any other method. It is unlikely to be improved soon in numba
or numpy
. So you are reaching an optimal performance here and can't do much in further optimisations.
Upvotes: 4
Reputation: 114320
Use np.lexsort
to sort both lists simultaneously:
idx = np.lexsort([values, ids])
The indices of the last unique ID are given by
last = np.r_[np.flatnonzero(np.diff(ids[idx])), len(ids) - 1]
You can use this to get the maxima of each group:
values[idx[last]]
This is the same as values[idx][last]
, but faster because you only need to extract len(last)
elements this way, instead of rearranging the whole array and then extracting.
Keep in mind that np.unique
is basically doing sort
and flatnonzero
when you do return_index=True
.
Here's a collection of all the solutions so far, along with a benchmark. I suggest a modified solution to @bb1's pandas recipe implemented below.
def Shannon(values, ids):
uniq_ids = np.unique(ids)
maxima = np.ones_like(uniq_ids) * np.nan
for i, id in enumerate(uniq_ids):
maxima[i] = np.max(values[np.where(ids == id)])
return maxima
def richardec(values, ids):
return [a.max() for a in np.split(values, np.arange(1, ids.shape[0])[(np.diff(ids) != 0)])]
def MadPhysicist(values, ids):
idx = np.lexsort([values, ids])
return values[idx[np.r_[np.flatnonzero(np.diff(ids[idx])), len(ids) - 1]]]
def PeptideWitch(values, ids):
return np.vectorize(lambda id: np.max(values[np.where(ids == id)]))(np.unique(ids))
def mathfux(values, ids):
idx = np.argsort(ids)
return np.maximum.reduceat(values[idx], np.r_[0, np.flatnonzero(np.diff(ids[idx])) + 1])
def bb1(values, ids):
return DataFrame({'ids': ids, 'vals': values}).groupby('ids')['values'].max().to_numpy()
def bb1_modified(values, ids):
return Series(values).groupby(ids).max().to_numpy()
values = np.array([5, 3, 2, 6, 3, 4, 8, 2, 4, 8])
ids = np.array([0, 1, 3, 3, 3, 3, 5, 6, 6, 6])
%timeit Shannon(values, ids)
42.1 µs ± 561 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit richardec(values, ids)
27.7 µs ± 181 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit MadPhysicist(values, ids)
19.3 µs ± 268 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit PeptideWitch(values, ids)
55.9 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit mathfux(values, ids)
20.9 µs ± 308 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit bb1(values, ids)
865 µs ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bb1_modified(values, ids)
537 µs ± 3.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
values = np.random.randint(10000, size=10000)
ids = np.random.randint(100, size=10000)
%timeit Shannon(values, ids)
1.76 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit richardec(values, ids)
29.1 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit MadPhysicist(values, ids)
904 µs ± 20 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit PeptideWitch(values, ids)
1.74 ms ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit mathfux(values, ids)
372 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bb1(values, ids)
964 µs ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bb1_modified(values, ids)
679 µs ± 7.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
@mathfux's suggestion of using np.maximum.reduceat
is by far the fastest all around, followed by the modified pandas solution and then lexsorting. The pandas solution is likely very similar to the reduceat
one, but with a lot more overhead. You can see that the incremental change for the modified pandas solution is tiny because of that.
For large arrays (which are the only ones that matter), it seems that finding the maxima individually is comparable to the question, while richardec's comprehension is much slower, despite the fact that it does not sort the input values according to the IDs.
Upvotes: 5
Reputation: 7863
With pandas:
import pandas as pd
def with_pandas(ids, vals):
df = pd.DataFrame({'ids': ids, 'vals': values})
return df.groupby('ids')['vals'].max().to_numpy()
Timing:
import numpy as np
values = np.random.randint(10000, size=10000)
ids = np.random.randint(100, size=10000)
%timeit with_pandas(ids, values)
692 µs ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 1
Reputation: 2349
EDIT: I'll leave this up as an example of why we can't always use np.vectorize() to make everything magically faster:
One solution is to use numpy's vectorize function:
import numpy as np
values = np.array([5, 3, 2, 6, 3, 4, 8, 2, 4, 8])
ids = np.array([0, 1, 3, 3, 3, 3, 5, 6, 6, 6])
def my_func(id):
return np.max(values[np.where(ids==id)])
vector_func = np.vectorize(my_func)
maximums = vector_func(np.unique(ids))
which returns
array([5, 3, 6, 8, 8])
But as for speed, your version has about the same performance when we use
values = np.array([random.randint(1, 100) for i in range(1000000)])
ids = []
for i in range(100000):
r = random.randint(1, 4)
if r == 3:
for x in range(3):
ids.append(i)
elif r == 2:
for x in range(4):
ids.append(i)
else:
ids.append(i)
ids = np.array(ids)
It's about 12 seconds per execution.
Upvotes: 1
Reputation: 231395
In trying to visualize the problem:
In [82]: [np.where(ids==id) for id in uniq_ids]
Out[82]:
[(array([0]),),
(array([1]),),
(array([2, 3, 4, 5]),),
(array([6]),),
(array([7, 8, 9]),)]
unique
can also return:
In [83]: np.unique(ids, return_inverse=True)
Out[83]: (array([0, 1, 3, 5, 6]), array([0, 1, 2, 2, 2, 2, 3, 4, 4, 4]))
Which is a variante on what richardec
produced:
In [88]: [a for a in np.split(ids, np.arange(1, ids.shape[0])[(np.diff(ids) !=
...: 0)])]
Out[88]: [array([0]), array([1]), array([3, 3, 3, 3]), array([5]), array([6, 6, 6])]
That inverse is also produced by doing where
on all ==
at once:
In [90]: ids[:,None] == uniq_ids
Out[90]:
array([[ True, False, False, False, False],
[False, True, False, False, False],
[False, False, True, False, False],
[False, False, True, False, False],
[False, False, True, False, False],
[False, False, True, False, False],
[False, False, False, True, False],
[False, False, False, False, True],
[False, False, False, False, True],
[False, False, False, False, True]])
In [91]: np.nonzero(ids[:,None] == uniq_ids)
Out[91]: (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1, 2, 2, 2, 2, 3, 4, 4, 4]))
I'm still thinking through this ...
Upvotes: 1
Reputation:
Here's a solution, which, although not 100% vectorized, (per my benchmarks) takes about half the time as your does (using your sample data). The performance improvement probably becomes more drastic with more data:
maximums = [a.max() for a in np.split(values, np.arange(1, ids.shape[0])[(np.diff(ids) != 0)])]
Output:
>>> maximums
[5, 3, 6, 8, 8]
Upvotes: 2