Abiel
Abiel

Reputation: 5465

Reading numpy arrays outside of Python

In a recent question I asked about the fastest way to convert a large numpy array to a delimited string. My reason for asking was because I wanted to take that plain text string and transmit it (over HTTP for instance) to clients written in other programming languages. A delimited string of numbers is obviously something that any client program can work with easily. However, it was suggested that because string conversion is slow, it would be faster on the Python side to do base64 encoding on the array and send it as binary. This is indeed faster.

My question now is, (1) how can I make sure my encoded numpy array will travel well to clients on different operating systems and different hardware, and (2) how do I decode the binary data on the client side.

For (1), my inclination is to do something like the following

import numpy as np
import base64
x = np.arange(100, dtype=np.float64)
base64.b64encode(x.tostring())

Is there anything else I need to do?

For (2), I would be happy to have an example in any programming language, where the goal is to take the numpy array of floats and turn them into a similar native data structure. Assume we have already done base64 decoding and have a byte array, and that we also know the numpy dtype, dimensions, and any other metadata which will be needed.

Thanks.

Upvotes: 6

Views: 3315

Answers (4)

lmjohns3
lmjohns3

Reputation: 7602

For something a little lighter-weight than HDF (though admittedly also more ad-hoc), you could also use JSON:

import json
import numpy as np

x = np.arange(100, dtype=np.float64)

print json.dumps(dict(data=x.tostring(),
                      shape=x.shape,
                      dtype=str(x.dtype)))

This would free your clients from needing to install HDF wrappers, at the expense of having to deal with a nonstandard protocol for data exchange (and possibly also needing to install JSON bindings !).

The tradeoff would be up to you to evaluate for your situation.

Upvotes: 2

Mike T
Mike T

Reputation: 43702

You should really look into OPeNDAP to simplify all aspects of scientific data networking. For Python, check out Pydap.

You can directly store your NumPy arrays into HDF5 format via h5py (or NetCDF), then stream the data to clients over HTTP using OPeNDAP.

Upvotes: 3

ars
ars

Reputation: 123558

I'd recommend using an existing data format for interchange of scientific data/arrays, such as NetCDF or HDF. In Python, you can use the PyNIO library which has numpy bindings, and there are several libraries for other languages. Both formats are built for handling large data and take care of language, machine representation problems, etc. They also work well with message passing, for example in parallel computing, so I suspect your use case is satisfied.

Upvotes: 1

Thomas Wouters
Thomas Wouters

Reputation: 133505

What the tostring method of numpy arrays does is basically give you a dump of the memory used by the array's data (not the object wrapper for Python, but just the data of the array.) This is similar to the struct stdlib module. Base64-encoding that string and sending it across should be quite good enough, although you may also need to send across the actual datatype used, as well as the dimensions if it's a multidimensional array, as you won't be able to tell those just from the data.

On the other side, how to read the data depends a little on the language. Most languages have a way of addressing such a block of memory as a particular type of array. For example, in C, you could simply base64-decode the string, assign it to (in the case of your example) a float64 * and index away. This doesn't give you any of the built-in safeguards and functions and other operations that numpy arrays have in Python, but that's because C is quite a different language in that respect.

Upvotes: 0

Related Questions