LeeZamparo
LeeZamparo

Reputation: 437

How to construct an np.array with fromiter

I'm trying to construct an np.array by sampling from a python generator, that yields one row of the array per invocation of next. Here is some sample code:

import numpy as np
data = np.eye(9)
labels = np.array([0,0,0,1,1,1,2,2,2])

def extract_one_class(X,labels,y):
""" Take an array of data X, a column vector array of labels, and one particular label y.  Return an array of all instances in X that have label y """

    return X[np.nonzero(labels[:] == y)[0],:]

def generate_points(data, labels, size):
""" Generate and return 'size' pairs of points drawn from different classes """

     label_alphabet = np.unique(labels)
     assert(label_alphabet.size > 1)

     for useless in xrange(size):
         shuffle(label_alphabet)
         first_class = extract_one_class(data,labels,label_alphabet[0])
         second_class = extract_one_class(data,labels,label_alphabet[1])
         pair = np.hstack((first_class[randint(0,first_class.shape[0]),:],second_class[randint(0,second_class.shape[0]),:]))
         yield pair

points = np.fromiter(generate_points(data,labels,5),dtype = np.dtype('f8',(2*data.shape[1],1)))

The extract_one_class function returns a subset of data: all data points belonging to one class label. I would like to have points be an np.array with shape = (size,data.shape[1]). Currently the code snippet above returns an error:

ValueError: setting an array element with a sequence.

The documentation of fromiter claims to return a one-dimensional array. Yet others have used fromiter to construct record arrays in numpy before (e.g http://iam.al/post/21116450281/numpy-is-my-homeboy).

Am I off the mark in assuming I can generate an array in this fashion? Or is my numpy just not quite right?

Upvotes: 4

Views: 6671

Answers (3)

summentier
summentier

Reputation: 456

Following some suggestions here, I came up with a fairly general drop-in replacement for numpy.fromiter() that satisfies the requirements of the OP:

import numpy as np
def fromiter(iterator, dtype, *shape):
    """Generalises `numpy.fromiter()` to multi-dimesional arrays.

    Instead of the number of elements, the parameter `shape` has to be given,
    which contains the shape of the output array. The first dimension may be
    `-1`, in which case it is inferred from the iterator.
    """
    res_shape = shape[1:]
    if not res_shape:  # Fallback to the "normal" fromiter in the 1-D case           
        return np.fromiter(iterator, dtype, shape[0])

    # This wrapping of the iterator is necessary because when used with the
    # field trick, np.fromiter does not enforce consistency of the shapes
    # returned with the '_' field and silently cuts additional elements.
    def shape_checker(iterator, res_shape):
        for value in iterator:
            if value.shape != res_shape:
                raise ValueError("shape of returned object %s does not match"
                                 " given shape %s" % (value.shape, res_shape))
            yield value,

    return np.fromiter(shape_checker(iterator, res_shape),
                       [("_", dtype, res_shape)], shape[0])["_"]

Upvotes: 1

Pierre GM
Pierre GM

Reputation: 20339

As you've noticed, the documentation of np.fromiter explains that the function creates a 1D array. You won't be able to create a 2D array that way, and @unutbu method of returning a 1D array that you reshape afterwards is a sure go.

However, you can indeed create structured arrays using fromiter, as illustrated by:

>>> import itertools
>>> a = itertools.izip((1,2,3),(10,20,30))
>>> r = np.fromiter(a,dtype=[('',int),('',int)])
array([(1, 10), (2, 20), (3, 30)], 
      dtype=[('f0', '<i8'), ('f1', '<i8')])

but look, r.shape=(3,), that is, r is really nothing but 1D array of records, each record being composed of two integers. Because all the fields have the same dtype, we can take a view of r as a 2D array

>>> r.view((int,2))
array([[ 1, 10],
       [ 2, 20],
       [ 3, 30]])

So, yes, you could try to use np.fromiter with a dtype like [('',int)]*data.shape[1]: you'll get a 1D array of length size, that you can then view this array as ((int, data.shape[1])). You can use floats instead of ints, the important part is that all fields have the same dtype.

If you really want it, you can use some fairly complex dtype. Consider for example

r = np.fromiter(((_,) for _ in a),dtype=[('',(int,2))])

Here, you get a 1D structured array with 1 field, the field consisting of an array of 2 integers. Note the use of (_,) to make sure that each record is passed as a tuple (else np.fromiter chokes). But do you need that complexity?

Note also that as you know the length of the array beforehand (it's size), you should use the counter optional argument of np.fromiter for more efficiency.

Upvotes: 9

unutbu
unutbu

Reputation: 879919

You could modify generate_points to yield single floats instead of np.arrays, use np.fromiter to form a 1D array, and then use .reshape(size, -1) to make it a 2D array.

points = np.fromiter(
    generate_points(data,labels,5)).reshape(size, -1)

Upvotes: 5

Related Questions