Reputation: 1393
Suppose I have a 1d array a
where from each element I would like to have a range of which the size is stored in ranges
:
a = np.array([10,9,12])
ranges = np.array([2,4,3])
The desired output would be:
np.array([10,11,9,10,11,12,12,13,14])
I could of course use a for
loop, but I prefer a fully vectorized approach. np.repeat
allows one to repeat the elements in a
a number of times by setting repeats=
, but I am not aware of a similar numpy
function particularly dealing with the problem above.
Upvotes: 1
Views: 1379
Reputation: 26886
This problem is probably going to be solved much faster with a Numba-compiled function:
@nb.jit
def expand_range(values, counts):
n = len(values)
m = np.sum(counts)
r = np.zeros((m,), dtype=values.dtype)
k = 0
for i in range(n):
x = values[i]
for j in range(counts[i]):
r[k] = x + j
k += 1
return r
On the very small inputs:
%timeit expand_range(a, ranges)
# 1.16 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 617 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
# 25 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
# 13.5 µs ± 929 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and on somewhat larger inputs:
b = np.random.randint(0, 1000, 1000)
b_ranges = np.random.randint(1, 10, 1000)
%timeit expand_range(b, b_ranges)
# 5.07 µs ± 98.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 617 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
# 25 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
# 13.5 µs ± 929 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
these show that with Numba-based approach winning the speed gain is at least 100x over any of the other approaches proposed so far.
With the numbers closer to what as been indicated in one of the comments by the OP:
b = np.random.randint(10, 1000, 1000)
b_ranges = np.random.randint(100, 1000, 1000)
%timeit expand_range(b, b_ranges)
# 1.5 ms ± 67.9 µs per loop (mean ± std. dev. of 7 runs, 1000
%timeit x = pd.Series(np.repeat(b, b_ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 91.8 ms ± 6.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(b, b_ranges)])
# 10.7 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.array([i for j in range(len(b)) for i in range(b[j],b[j]+b_ranges[j])])
# 144 ms ± 4.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
which is still at least a respectable 7x over the others.
Upvotes: 1
Reputation: 12192
Someone asked about timing, so I compared the times of the three solutions (so far) in a very simple manner, using the %timeit
magic function in Jupyter notebook cells.
I set it up as follows:
N = 1
a = np.array([10,9,12])
a = np.tile(a, N)
ranges = np.array([2,4,3])
ranges = np.tile(ranges, N)
a.shape, ranges.shape
So I could easily scale (albeit things not random, but repeated).
Then I ran:
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
,
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
and
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
Results are as follows: N = 1:
N = 10:
N = 100:
N = 1000:
So the Pandas solution wins when things get to arrays of 1000 elements or more, but the Python double list comprehension does an excellent job until that point. np.hstack
probably loses out because of extra memory allocation and copying, but that's a guess. Note also that the Pandas solution is nearly the same time for each array size.
Caveats still exists because there are repeated numbers, and all values are relatively small integers. This really shouldn't matter, but I'm not (yet) betting on it. (For example, Pandas groupby functionality may be fast because of the repeated numbers.)
Bonus: the OP has statement in a comment that "The real life arrays are around 1000 elements, yet with ranges ranging from 100 to 1000. So becomes quite big – pr94". So I adjusted my timing test to the following:
import numpy as np
import pandas as pd
N = 1000
a = np.random.randint(100, 1000, N)
# This is how I understand "ranges ranging from 100 to 1000"
ranges = np.random.randint(100, 1000, N)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
Which comes out as :
Which shows that those caveats I mentioned, in some form at least, do seem to exist. But people should double check whether this testing code is actually the most relevant and appropriate, and whether it is correct.
Upvotes: 1
Reputation: 12192
>>> np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
array([10, 11, 9, 10, 11, 12, 12, 13, 14])
Upvotes: 2
Reputation: 71560
With pandas
it could be easier:
>>> import pandas as pd
>>> x = pd.Series(np.repeat(a, ranges))
>>> x + x.groupby(x).cumcount()
0 10
1 11
2 9
3 10
4 11
5 12
6 12
7 13
8 14
dtype: int64
>>>
If you want a numpy array:
>>> x.add(x.groupby(x).cumcount()).to_numpy()
array([10, 11, 9, 10, 11, 12, 12, 13, 14], dtype=int64)
>>>
Upvotes: 1