Reputation: 7342
I have an iterator that looks something like the following
it = ((x, x**2) for x in range(20))
and what I want is two arrays. one of the x
s and the other of the x**2
s but I don't actually know the number of elements, and I can't convert from one entry to the other, so I couldn't build the first, and then build the second from the first.
If I had only one outcome with unknown size, I could use np.fromiter
to have it dynamically allocate efficiently, e.g.
y = np.fromiter((x[0] for x in it), float)
with two I would hope I could do something like
ita, itb = itertools.tee(it)
y = np.fromiter((x[0] for x in ita), float)
y2 = np.fromiter((x[1] for x in itb), float)
but because the first call exhausts the iterator, I'd be better off doing
lst = list(it)
y = np.fromiter((x[0] for x in lst), float, len(lst))
y2 = np.fromiter((x[1] for x in lst), float, len(lst))
Because tee will be filling a deque the size of the whole list anyways. I'd love to avoid copying the iterator into a list before then copying it again into an array, but I can't think of a way to incrementally build an array without doing it entirely manually. In addition, fromiter
seems to be written in c, so writing it in python would probably end up with no negligible difference over making a list first.
Upvotes: 2
Views: 337
Reputation: 879919
You could use np.fromiter
to build one array with all the values, and then slice the array:
In [103]: it = ((x, x**2) for x in range(20))
In [104]: import itertools
In [105]: y = np.fromiter(itertools.chain.from_iterable(it), dtype=float)
In [106]: y
Out[106]:
array([ 0., 0., 1., 1., 2., 4., 3., 9., 4.,
16., 5., 25., 6., 36., 7., 49., 8., 64.,
9., 81., 10., 100., 11., 121., 12., 144., 13.,
169., 14., 196., 15., 225., 16., 256., 17., 289.,
18., 324., 19., 361.])
In [107]: y, y2 = y[::2], y[1::2]
In [108]: y
Out[108]:
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.,
11., 12., 13., 14., 15., 16., 17., 18., 19.])
In [109]: y2
Out[109]:
array([ 0., 1., 4., 9., 16., 25., 36., 49., 64.,
81., 100., 121., 144., 169., 196., 225., 256., 289.,
324., 361.])
The above manages to load the data from the iterator into arrays without the use of intermediate Python lists. The underlying data in the arrays is not contiguous, however. Many operations are faster on contiguous arrays:
In [19]: a = np.arange(10**6)
In [20]: y1 = a[::2]
In [21]: z1 = np.ascontiguousarray(y1)
In [24]: %timeit y1.sum()
1000 loops, best of 3: 975 µs per loop
In [25]: %timeit z1.sum()
1000 loops, best of 3: 464 µs per loop
So you may wish to make y
and y2
contiguous:
y = np.ascontiguousarray(y)
y2 = np.ascontiguousarray(y2)
Calling np.ascontiguousarray
requires copying the non-contiguous data in y
and y2
into new arrays. Unfortunately, I do not see a way to create y
and
y2
as contiguous arrays without copying.
Here is a benchmark comparing the use of an intermediate Python list to NumPy slices (with and without ascontiguousarray
):
import numpy as np
import itertools as IT
def using_intermediate_list(g):
lst = list(g)
y = np.fromiter((x[0] for x in lst), float, len(lst))
y2 = np.fromiter((x[1] for x in lst), float, len(lst))
return y, y2
def using_slices(g):
y = np.fromiter(IT.chain.from_iterable(g), dtype=float)
y, y2 = y[::2], y[1::2]
return y, y2
def using_slices_contiguous(g):
y = np.fromiter(IT.chain.from_iterable(g), dtype=float)
y, y2 = y[::2], y[1::2]
y = np.ascontiguousarray(y)
y2 = np.ascontiguousarray(y2)
return y, y2
def using_array(g):
y = np.array(list(g))
y, y2 = y[:, 0], y[:, 1]
return y, y2
In [27]: %timeit using_intermediate_list(((x, x**2) for x in range(10**6)))
1 loops, best of 3: 376 ms per loop
In [28]: %timeit using_slices(((x, x**2) for x in range(10**6)))
1 loops, best of 3: 220 ms per loop
In [29]: %timeit using_slices_contiguous(((x, x**2) for x in range(10**6)))
1 loops, best of 3: 237 ms per loop
In [34]: %timeit using_array(((x, x**2) for x in range(10**6)))
1 loops, best of 3: 707 ms per loop
Upvotes: 2