Erik
Erik

Reputation: 7342

Numpy create two arrays using fromiter simultaneously

I have an iterator that looks something like the following

it = ((x, x**2) for x in range(20))

and what I want is two arrays. one of the xs and the other of the x**2s but I don't actually know the number of elements, and I can't convert from one entry to the other, so I couldn't build the first, and then build the second from the first.

If I had only one outcome with unknown size, I could use np.fromiter to have it dynamically allocate efficiently, e.g.

y = np.fromiter((x[0] for x in it), float)

with two I would hope I could do something like

ita, itb = itertools.tee(it)
y = np.fromiter((x[0] for x in ita), float)
y2 = np.fromiter((x[1] for x in itb), float)

but because the first call exhausts the iterator, I'd be better off doing

lst = list(it)
y = np.fromiter((x[0] for x in lst), float, len(lst))
y2 = np.fromiter((x[1] for x in lst), float, len(lst))

Because tee will be filling a deque the size of the whole list anyways. I'd love to avoid copying the iterator into a list before then copying it again into an array, but I can't think of a way to incrementally build an array without doing it entirely manually. In addition, fromiter seems to be written in c, so writing it in python would probably end up with no negligible difference over making a list first.

Upvotes: 2

Views: 337

Answers (1)

unutbu
unutbu

Reputation: 879919

You could use np.fromiter to build one array with all the values, and then slice the array:

In [103]: it = ((x, x**2) for x in range(20))

In [104]: import itertools

In [105]: y = np.fromiter(itertools.chain.from_iterable(it), dtype=float)

In [106]: y
Out[106]: 
array([   0.,    0.,    1.,    1.,    2.,    4.,    3.,    9.,    4.,
         16.,    5.,   25.,    6.,   36.,    7.,   49.,    8.,   64.,
          9.,   81.,   10.,  100.,   11.,  121.,   12.,  144.,   13.,
        169.,   14.,  196.,   15.,  225.,   16.,  256.,   17.,  289.,
         18.,  324.,   19.,  361.])

In [107]: y, y2 = y[::2], y[1::2]

In [108]: y
Out[108]: 
array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
        11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.])

In [109]: y2
Out[109]: 
array([   0.,    1.,    4.,    9.,   16.,   25.,   36.,   49.,   64.,
         81.,  100.,  121.,  144.,  169.,  196.,  225.,  256.,  289.,
        324.,  361.])

The above manages to load the data from the iterator into arrays without the use of intermediate Python lists. The underlying data in the arrays is not contiguous, however. Many operations are faster on contiguous arrays:

In [19]: a = np.arange(10**6)

In [20]: y1 = a[::2]

In [21]: z1 = np.ascontiguousarray(y1)

In [24]: %timeit y1.sum()
1000 loops, best of 3: 975 µs per loop

In [25]: %timeit z1.sum()
1000 loops, best of 3: 464 µs per loop

So you may wish to make y and y2 contiguous:

y = np.ascontiguousarray(y)
y2 = np.ascontiguousarray(y2)

Calling np.ascontiguousarray requires copying the non-contiguous data in y and y2 into new arrays. Unfortunately, I do not see a way to create y and y2 as contiguous arrays without copying.


Here is a benchmark comparing the use of an intermediate Python list to NumPy slices (with and without ascontiguousarray):

import numpy as np
import itertools as IT

def using_intermediate_list(g):
    lst = list(g)
    y = np.fromiter((x[0] for x in lst), float, len(lst))
    y2 = np.fromiter((x[1] for x in lst), float, len(lst))
    return y, y2

def using_slices(g):
    y = np.fromiter(IT.chain.from_iterable(g), dtype=float)
    y, y2 = y[::2], y[1::2]
    return y, y2

def using_slices_contiguous(g):
    y = np.fromiter(IT.chain.from_iterable(g), dtype=float)
    y, y2 = y[::2], y[1::2]
    y = np.ascontiguousarray(y)
    y2 = np.ascontiguousarray(y2)
    return y, y2

def using_array(g):
    y = np.array(list(g))
    y, y2 = y[:, 0], y[:, 1]
    return y, y2

In [27]: %timeit using_intermediate_list(((x, x**2) for x in range(10**6)))
1 loops, best of 3: 376 ms per loop

In [28]: %timeit using_slices(((x, x**2) for x in range(10**6)))
1 loops, best of 3: 220 ms per loop

In [29]: %timeit using_slices_contiguous(((x, x**2) for x in range(10**6)))
1 loops, best of 3: 237 ms per loop

In [34]: %timeit using_array(((x, x**2) for x in range(10**6)))
1 loops, best of 3: 707 ms per loop

Upvotes: 2

Related Questions