El Sampsa
El Sampsa

Reputation: 1733

Copying bytes in Python from Numpy array into string or bytearray

I am reading data from an UDP socket in a while loop. I need the most efficient way to

1) Read the data (*) (that's kind of solved, but comments are appreciated)

2) Dump the (manipulated) data periodically in a file (**) (The Question)

I am anticipating a bottleneck in the numpy's "tostring" method. Let's consider the following piece of (an incomplete) code:

import socket
import numpy

nbuf=4096
buf=numpy.zeros(nbuf,dtype=numpy.uint8) # i.e., an array of bytes
f=open('dump.data','w')

datasocket=socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
# ETC.. (code missing here) .. the datasocket is, of course, non-blocking

while True:
  gotsome=True
  try:
    N=datasocket.recv_into(buf) # no memory-allocation here .. (*)
  except(socket.error):
    # do nothing ..
    gotsome=False

  if (gotsome):
    # the bytes in "buf" will be manipulated in various ways ..
    # the following write is done frequently (not necessarily in each pass of the while loop):
    f.write(buf[:N].tostring())  # (**) The question: what is the most efficient way to do this?

f.close() 

Now, at (**), as I understand it:

1) buf[:N] allocates memory for a new array object, having the length N+1, right? (maybe not)

.. and after that:

2) buf[:N].tostring() allocates memory for a new string, and the bytes from buf are copied into this string

That seems a lot of memory-allocation & swapping. In this same loop, in the future, I will read several sockets and write into several files.

Is there a way to just tell f.write to access directly the memory address of "buf" from 0 to N bytes and write them onto the disk?

I.e., to do this in the spirit of the buffer interface and avoid those two extra memory allocations?

P. S. f.write(buf[:N].tostring()) is equivalent to buf[:N].tofile(f)

Upvotes: 2

Views: 3654

Answers (1)

Joe Kington
Joe Kington

Reputation: 284602

Basically, it sounds like you want to use the array's tofile method or directly use the ndarray.data buffer object.

For your exact use-case, using the array's data buffer is the most efficient, but there are a lot of caveats that you need to be aware of for general use. I'll elaborate in a bit.


However, first let me answer a couple of your questions and provide a bit of clarification:

buf[:N] allocates memory for a new array object, having the length N+1, right?

It depends on what you mean by "new array object". Very little additional memory is allocated, regardless of the size of the arrays involved.

It does allocate memory for a new array object (a few bytes), but it does not allocate additional memory for the array's data. Instead, it creates a "view" that shares the original array's data buffer. Any changes you make to y = buf[:N] will affect buf as well.

buf[:N].tostring() allocates memory for a new string, and the bytes from buf are copied into this string

Yes, that's correct.

On a side note, you can actually go the opposite way (string to array) without allocating any additional memory:

somestring = 'This could be a big string'
arr = np.frombuffer(buffer(somestring), dtype=np.uint8)

However, because python strings are immutable, arr will be read-only.


Is there a way to just tell f.write to access directly the memory address of "buf" from 0 to N bytes and write them onto the disk?

Yep!

Basically, you'd want:

f.write(buf[:N].data)

This is very efficient and will work for any file-like object. It's almost definitely what you want in this exact case. However, there are several caveats!

First off, note that N will be in items in the array, not in bytes directly. They're equivalent in your example code (due to dtype=np.int8, or any other 8-bit datatype).

If you did want to write a number of bytes, you could do

f.write(buf.data[:N])

...but slicing the arr.data buffer will allocate a new string, so it's functionally similar to buf[:N].tostring(). At any rate, be aware that doing f.write(buf[:N].tostring()) is different than doing f.write(buf.data[:N]) for most dtypes, but both will allocate a new string.

Next, numpy arrays can share data buffers. In your example case, you don't need to worry about this, but in general, using somearr.data can lead to surprises for this reason.

As an example:

x = np.arange(10, dtype=np.uint8)
y = x[::2]

Now, y shares the same memory buffer as x, but it's not contiguous in memory (have a look at x.flags vs y.flags). Instead it references every other item in x's memory buffer (compare x.strides to y.strides).

If we try to access y.data, we'll get an error telling us that this is not a contiguous array in memory, and we can't get a single-segment buffer for it:

In [5]: y.data
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-54-364eeabf8187> in <module>()
----> 1 y.data

AttributeError: cannot get single-segment buffer for discontiguous array

This is a large part of the reason that numpy array's have a tofile method (it also pre-dates python's buffers, but that's another story).

tofile will write the data in the array to a file without allocating additional memory. However, because it's implemented at the C-level it only works for real file objects, not file-like objects (e.g. a socket, StringIO, etc).

For example:

buf[:N].tofile(f)

However, this is implemented at the C-level, and will only work for actual file objects, not sockets, StringIO, and other file-like objects.

This does allow you to use arbitrary array indexing, however.

buf[someslice].tofile(f)

Will make a new view (same memory buffer), and efficiently write it to disk. In your exact case, it will be slightly slower than slicing the arr.data buffer and directly writing it to disk. If you'd prefer to use array indexing (and not number of bytes) then the ndarray.tofile method will be more efficient than f.write(arr.tostring()).

Upvotes: 8

Related Questions