How can I initialize a NumPy array from a multidimensional buffer?

Question

The documentation for the numpy.frombuffer function specifically says that the generated array will be one dimensional:

Interpret a buffer as a 1-dimensional array.

I'm not sure about the consequences of this quote. The documentation just tells me that the generated array will be one dimensional, but never says that the input buffer has to describe a one-dimensional object.

I have a (2D) Eigen matrix in C++. I would like to create a Python buffer which describes the content of the matrix. Then, I would like to use this buffer to somehow initialize my NumPy array and make it available to my python scripts. The goal is both to pass information to Python without copying data and to allow python modify the matrix (e.g. to initialize the matrix).

The C-API equivalent of numpy.frombuffer is PyArray_FromBuffer, and it also shares the single-dimension phrase, but it has more documentation (emphasis mine):

PyObject* PyArray_FromBuffer(PyObject* buf, PyArray_Descr* dtype, npy_intp count, npy_intp offset)

Construct a one-dimensional ndarray of a single type from an object, buf, that exports the (single-segment) buffer protocol (or has an attribute __buffer__ that returns an object that exports the buffer protocol). A writeable buffer will be tried first followed by a read- only buffer. The NPY_ARRAY_WRITEABLE flag of the returned array will reflect which one was successful. The data is assumed to start at offset bytes from the start of the memory location for the object. The type of the data in the buffer will be interpreted depending on the data- type descriptor, dtype. If count is negative then it will be determined from the size of the buffer and the requested itemsize, otherwise, count represents how many elements should be converted from the buffer.

Does "single-segment" mean that it cannot contain padding used, e.g., to align the rows of the matrix? In that case I'm screwed, because my matrix could very well use an alignment strategy that requires padding.

Back to the original question:

Is there a way for me to create a NumPy array which shares the memory with an pre-existing buffer?

Remark: there is a project on github called Eigen3ToPython, which aims at connecting eigen with python, but it does not allow for memory sharing (emphasis mine):

This library allows to: [...] Convert to/from Numpy arrays (np.array) in a transparent manner (however, memory is not shared between both representations)

EDIT Somebody might point out the similarly-titled question Numpy 2D- Array from Buffer?. Unfortunately, the solution given there does not seem to be a valid one for my case, because the generated 2D array does not share the memory with the original buffer.

EDIT: how is data organized in Eigen

Eigen maps 2D matrices in a 1D memory buffer by using strided access. A double precision 3x2 matrix, for instance, needs 6 double, i.e., 48 bytes. A 48-bytes buffer is allocated. The first element in this buffer represents the [0, 0] entry in the matrix.

In order to access the element [i, j], the following formula is used:

double* v = matrix.data() + i*matrix.rowStride() + j*matrix.colStride()

, where matrix is the matrix object and its member functions data(), rowStride() and colStride() return, respectively, the start address of the buffer, the distance between two consecutive rows and the distance between two consecutive columns (in multiples of the floating point format size).

By default Eigen uses a column-major format, thus rowStride() == 1, but it can also be configured to use a row-major format, with colStride() == 1.

Another important configuration option is the alignment. The data buffer could very well include some unneeded values (i.e., values which are not part of the matrix) so to make the columns or rows start at aligned addresses. This makes the operations on the matrix vectorizable. In the example above, assuming column-major format and 16-byte alignment, the following matrix

3   7
1  -2
4   5

could be stored win the following buffer:

0  0  3  1  4  0  7 -2  5  0

The 0 values are called padding. The two 0's at the beginning could be necessary to ensure that the start of the actual data is aligned to the same boundary. (Notice that the data() member function will return the address of the 3.) In this case the strides for rows and columns are

rowStride: 1
colStride: 4

(while in the unaligned case they would be 1 and 3 respectively.)

Numpy expects a C-contiguous buffer, i.e., a row-major structure with no padding. If no padding is inserted by Eigen, then the problem of the row-major requirement can be worked around for a column-major Eigen matrix pretty easily: one passes the buffer to a numpy array, and the resulting ndarray is reshaped and transposed. I managed to make this work perfectly.

But in case Eigen does insert padding, the problem can not be solved using this technique because the ndarray will still see the zeroes in the data and think they are part of the matrix, at the same time discarding some values at the end of the array. And this is the problem I'm asking a solution for.

Now, as a side remark, since we have the luck of having @ggael in the loop, who can probably shed some light, I have to admit that I never had Eigen inserting any padding in my matrices. And I don't seem to find any mention of padding in the Eigen documentation. However, I would expect the alignment strategy to align every column (or row), and not just the first one. Am I wrong with my expectations? If I am, then the whole problem does not apply to Eigen. But it would apply to other libraries I'm using which apply the alignment strategy I described above, so please don't consider this last paragraph when answering the question.

Spiros · Accepted Answer

I'm answering my own question here. Thanks to @user2357112 for pointing in the right direction: what I need is PyArray_NewFromDescr.

The following Python object is a wrapper around an Eigen matrix:

struct PyEigenMatrix {
    PyObject_HEAD
    Eigen::Matrix matrix;
};

RealT is the floating-point type I'm using (float in my case).

In order to return an np.ndarray object, I add a member function to the class:

static PyObject*
PyEigenMatrix_as_ndarray(PyEigenMatrix* self, PyObject* args, PyObject* kwds)
{
    // Extract number of rows and columns from Eigen matrix
    npy_intp dims[] = { self->matrix.rows(), self->matrix.cols() };

    // Extract strides from Eigen Matrix (multiply by type size to get bytes)
    npy_intp strides[] = {
        self->matrix.rowStride() * (npy_intp)sizeof(RealT),
        self->matrix.colStride() * (npy_intp)sizeof(RealT)
    };

    // Create and return the ndarray
    return PyArray_NewFromDescr(
            &PyArray_Type,                  // Standard type
            PyArray_DescrFromType(typenum), // Numpy type id
            2,                              // Number of dimensions
            dims,                           // Dimension array
            strides,                        // Strides array
            self->matrix.data(),            // Pointer to data
            NPY_ARRAY_WRITEABLE,            // Flags
            (PyObject*)self                 // obj (?)
        );
}

typenum is the numpy type id number.

This call creates a new numpy array, gives it a buffer (through the data parameters), describes the buffer using the dims and strides parameters (the former also sets the shape of the returned array), describes the data dype, sets the matrix as read-write (through the flags parameter.

I'm not sure what the last parameter obj means though. The documentation mentions it only for cases where the type is different from PyArray_Type.

In order to illustrate how this works in practice, let me show some python code.

In [3]: m = Matrix(7, 3)

In [4]: m
Out[4]: 
  0.680375  -0.211234   0.566198
   0.59688   0.823295  -0.604897
 -0.329554   0.536459  -0.444451
   0.10794 -0.0452059   0.257742
 -0.270431  0.0268018   0.904459
   0.83239   0.271423   0.434594
 -0.716795   0.213938  -0.967399

In [5]: a = m.as_ndarray()

In [6]: a
Out[6]: 
array([[ 0.68 , -0.211,  0.566],
       [ 0.597,  0.823, -0.605],
       [-0.33 ,  0.536, -0.444],
       [ 0.108, -0.045,  0.258],
       [-0.27 ,  0.027,  0.904],
       [ 0.832,  0.271,  0.435],
       [-0.717,  0.214, -0.967]], dtype=float32)

In [7]: a[2, 1] += 4

In [8]: a
Out[8]: 
array([[ 0.68 , -0.211,  0.566],
       [ 0.597,  0.823, -0.605],
       [-0.33 ,  4.536, -0.444],
       [ 0.108, -0.045,  0.258],
       [-0.27 ,  0.027,  0.904],
       [ 0.832,  0.271,  0.435],
       [-0.717,  0.214, -0.967]], dtype=float32)

In [9]: m
Out[9]: 
  0.680375  -0.211234   0.566198
   0.59688   0.823295  -0.604897
 -0.329554    4.53646  -0.444451
   0.10794 -0.0452059   0.257742
 -0.270431  0.0268018   0.904459
   0.83239   0.271423   0.434594
 -0.716795   0.213938  -0.967399

Matrix is my PyEigenMatrix type. I added a __repr__ function which prints the matrix using Eigen's stream operators. I can had an ndarray a which corresponds exactly to the Eigen matrix. When I modify a (In[7]), not only does the numpy array get modified (Out[8]), but also the underlying Eigen array (Out[9]), showing that the two object share the same memory.

EDIT @user2357112 was right twice. The second method he proposes in the comments works as well. If the type PyEigenMatrix exports the buffer interface (which my type does), then the solution is as easy as creating a memoryview object, either in Python or using the C-API, and pass this object to the np.array function, also specifying copy=False.

Here is how it works:

In [2]: m = Matrix(7, 3)

In [3]: mv = memoryview(m)    

In [4]: a = np.array(mv, copy=False)

In [5]: m
Out[5]: 
  0.680375   0.536459   0.904459
 -0.211234  -0.444451    0.83239
  0.566198    0.10794   0.271423
   0.59688 -0.0452059   0.434594
  0.823295   0.257742  -0.716795
 -0.604897  -0.270431   0.213938
 -0.329554  0.0268018  -0.967399

In [6]: a
Out[6]: 
array([[ 0.68 ,  0.536,  0.904],
       [-0.211, -0.444,  0.832],
       [ 0.566,  0.108,  0.271],
       [ 0.597, -0.045,  0.435],
       [ 0.823,  0.258, -0.717],
       [-0.605, -0.27 ,  0.214],
       [-0.33 ,  0.027, -0.967]], dtype=float32)

In [7]: a [3, 1] += 2

In [8]: a
Out[8]: 
array([[ 0.68 ,  0.536,  0.904],
       [-0.211, -0.444,  0.832],
       [ 0.566,  0.108,  0.271],
       [ 0.597,  1.955,  0.435],
       [ 0.823,  0.258, -0.717],
       [-0.605, -0.27 ,  0.214],
       [-0.33 ,  0.027, -0.967]], dtype=float32)

In [9]: m
Out[9]: 
 0.680375  0.536459  0.904459
-0.211234 -0.444451   0.83239
 0.566198   0.10794  0.271423
  0.59688   1.95479  0.434594
 0.823295  0.257742 -0.716795
-0.604897 -0.270431  0.213938
-0.329554 0.0268018 -0.967399

This method has the advantage that it does not require the numpy C-API. The matrix type just has to support the buffer protocol, which is more general than a method directly relying on numpy.

How can I initialize a NumPy array from a multidimensional buffer?

Answers (1)

Related Questions