Numpy create two arrays using fromiter simultaneously

Question

I have an iterator that looks something like the following

it = ((x, x**2) for x in range(20))

and what I want is two arrays. one of the xs and the other of the x**2s but I don't actually know the number of elements, and I can't convert from one entry to the other, so I couldn't build the first, and then build the second from the first.

If I had only one outcome with unknown size, I could use np.fromiter to have it dynamically allocate efficiently, e.g.

y = np.fromiter((x[0] for x in it), float)

with two I would hope I could do something like

ita, itb = itertools.tee(it)
y = np.fromiter((x[0] for x in ita), float)
y2 = np.fromiter((x[1] for x in itb), float)

but because the first call exhausts the iterator, I'd be better off doing

lst = list(it)
y = np.fromiter((x[0] for x in lst), float, len(lst))
y2 = np.fromiter((x[1] for x in lst), float, len(lst))

Because tee will be filling a deque the size of the whole list anyways. I'd love to avoid copying the iterator into a list before then copying it again into an array, but I can't think of a way to incrementally build an array without doing it entirely manually. In addition, fromiter seems to be written in c, so writing it in python would probably end up with no negligible difference over making a list first.

unutbu · Accepted Answer

You could use np.fromiter to build one array with all the values, and then slice the array:

In [103]: it = ((x, x**2) for x in range(20))

In [104]: import itertools

In [105]: y = np.fromiter(itertools.chain.from_iterable(it), dtype=float)

In [106]: y
Out[106]: 
array([   0.,    0.,    1.,    1.,    2.,    4.,    3.,    9.,    4.,
         16.,    5.,   25.,    6.,   36.,    7.,   49.,    8.,   64.,
          9.,   81.,   10.,  100.,   11.,  121.,   12.,  144.,   13.,
        169.,   14.,  196.,   15.,  225.,   16.,  256.,   17.,  289.,
         18.,  324.,   19.,  361.])

In [107]: y, y2 = y[::2], y[1::2]

In [108]: y
Out[108]: 
array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
        11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.])

In [109]: y2
Out[109]: 
array([   0.,    1.,    4.,    9.,   16.,   25.,   36.,   49.,   64.,
         81.,  100.,  121.,  144.,  169.,  196.,  225.,  256.,  289.,
        324.,  361.])

The above manages to load the data from the iterator into arrays without the use of intermediate Python lists. The underlying data in the arrays is not contiguous, however. Many operations are faster on contiguous arrays:

In [19]: a = np.arange(10**6)

In [20]: y1 = a[::2]

In [21]: z1 = np.ascontiguousarray(y1)

In [24]: %timeit y1.sum()
1000 loops, best of 3: 975 µs per loop

In [25]: %timeit z1.sum()
1000 loops, best of 3: 464 µs per loop

So you may wish to make y and y2 contiguous:

y = np.ascontiguousarray(y)
y2 = np.ascontiguousarray(y2)

Calling np.ascontiguousarray requires copying the non-contiguous data in y and y2 into new arrays. Unfortunately, I do not see a way to create y and y2 as contiguous arrays without copying.

Here is a benchmark comparing the use of an intermediate Python list to NumPy slices (with and without ascontiguousarray):

import numpy as np
import itertools as IT

def using_intermediate_list(g):
    lst = list(g)
    y = np.fromiter((x[0] for x in lst), float, len(lst))
    y2 = np.fromiter((x[1] for x in lst), float, len(lst))
    return y, y2

def using_slices(g):
    y = np.fromiter(IT.chain.from_iterable(g), dtype=float)
    y, y2 = y[::2], y[1::2]
    return y, y2

def using_slices_contiguous(g):
    y = np.fromiter(IT.chain.from_iterable(g), dtype=float)
    y, y2 = y[::2], y[1::2]
    y = np.ascontiguousarray(y)
    y2 = np.ascontiguousarray(y2)
    return y, y2

def using_array(g):
    y = np.array(list(g))
    y, y2 = y[:, 0], y[:, 1]
    return y, y2

In [27]: %timeit using_intermediate_list(((x, x**2) for x in range(10**6)))
1 loops, best of 3: 376 ms per loop

In [28]: %timeit using_slices(((x, x**2) for x in range(10**6)))
1 loops, best of 3: 220 ms per loop

In [29]: %timeit using_slices_contiguous(((x, x**2) for x in range(10**6)))
1 loops, best of 3: 237 ms per loop

In [34]: %timeit using_array(((x, x**2) for x in range(10**6)))
1 loops, best of 3: 707 ms per loop

Numpy create two arrays using fromiter simultaneously

Answers (1)

Related Questions