Reputation: 1733
I am reading data from an UDP socket in a while loop. I need the most efficient way to
1) Read the data (*) (that's kind of solved, but comments are appreciated)
2) Dump the (manipulated) data periodically in a file (**) (The Question)
I am anticipating a bottleneck in the numpy's "tostring" method. Let's consider the following piece of (an incomplete) code:
import socket
import numpy
nbuf=4096
buf=numpy.zeros(nbuf,dtype=numpy.uint8) # i.e., an array of bytes
f=open('dump.data','w')
datasocket=socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
# ETC.. (code missing here) .. the datasocket is, of course, non-blocking
while True:
gotsome=True
try:
N=datasocket.recv_into(buf) # no memory-allocation here .. (*)
except(socket.error):
# do nothing ..
gotsome=False
if (gotsome):
# the bytes in "buf" will be manipulated in various ways ..
# the following write is done frequently (not necessarily in each pass of the while loop):
f.write(buf[:N].tostring()) # (**) The question: what is the most efficient way to do this?
f.close()
Now, at (**), as I understand it:
1) buf[:N] allocates memory for a new array object, having the length N+1, right? (maybe not)
.. and after that:
2) buf[:N].tostring() allocates memory for a new string, and the bytes from buf are copied into this string
That seems a lot of memory-allocation & swapping. In this same loop, in the future, I will read several sockets and write into several files.
Is there a way to just tell f.write to access directly the memory address of "buf" from 0 to N bytes and write them onto the disk?
I.e., to do this in the spirit of the buffer interface and avoid those two extra memory allocations?
P. S. f.write(buf[:N].tostring()) is equivalent to buf[:N].tofile(f)
Upvotes: 2
Views: 3654
Reputation: 284602
Basically, it sounds like you want to use the array's tofile
method or directly use the ndarray.data
buffer object.
For your exact use-case, using the array's data
buffer is the most efficient, but there are a lot of caveats that you need to be aware of for general use. I'll elaborate in a bit.
However, first let me answer a couple of your questions and provide a bit of clarification:
buf[:N]
allocates memory for a new array object, having the length N+1, right?
It depends on what you mean by "new array object". Very little additional memory is allocated, regardless of the size of the arrays involved.
It does allocate memory for a new array object (a few bytes), but it does not allocate additional memory for the array's data. Instead, it creates a "view" that shares the original array's data buffer. Any changes you make to y = buf[:N]
will affect buf
as well.
buf[:N].tostring() allocates memory for a new string, and the bytes from buf are copied into this string
Yes, that's correct.
On a side note, you can actually go the opposite way (string to array) without allocating any additional memory:
somestring = 'This could be a big string'
arr = np.frombuffer(buffer(somestring), dtype=np.uint8)
However, because python strings are immutable, arr
will be read-only.
Is there a way to just tell f.write to access directly the memory address of "buf" from 0 to N bytes and write them onto the disk?
Yep!
Basically, you'd want:
f.write(buf[:N].data)
This is very efficient and will work for any file-like object. It's almost definitely what you want in this exact case. However, there are several caveats!
First off, note that N
will be in items in the array, not in bytes directly. They're equivalent in your example code (due to dtype=np.int8
, or any other 8-bit datatype).
If you did want to write a number of bytes, you could do
f.write(buf.data[:N])
...but slicing the arr.data
buffer will allocate a new string, so it's functionally similar to buf[:N].tostring()
. At any rate, be aware that doing f.write(buf[:N].tostring())
is different than doing f.write(buf.data[:N])
for most dtypes, but both will allocate a new string.
Next, numpy arrays can share data buffers. In your example case, you don't need to worry about this, but in general, using somearr.data
can lead to surprises for this reason.
As an example:
x = np.arange(10, dtype=np.uint8)
y = x[::2]
Now, y
shares the same memory buffer as x
, but it's not contiguous in memory (have a look at x.flags
vs y.flags
). Instead it references every other item in x
's memory buffer (compare x.strides
to y.strides
).
If we try to access y.data
, we'll get an error telling us that this is not a contiguous array in memory, and we can't get a single-segment buffer for it:
In [5]: y.data
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-54-364eeabf8187> in <module>()
----> 1 y.data
AttributeError: cannot get single-segment buffer for discontiguous array
This is a large part of the reason that numpy array's have a tofile
method (it also pre-dates python's buffer
s, but that's another story).
tofile
will write the data in the array to a file without allocating additional memory. However, because it's implemented at the C-level it only works for real file
objects, not file-like objects (e.g. a socket, StringIO, etc).
For example:
buf[:N].tofile(f)
However, this is implemented at the C-level, and will only work for actual file objects, not sockets, StringIO, and other file-like objects.
This does allow you to use arbitrary array indexing, however.
buf[someslice].tofile(f)
Will make a new view (same memory buffer), and efficiently write it to disk. In your exact case, it will be slightly slower than slicing the arr.data
buffer and directly writing it to disk.
If you'd prefer to use array indexing (and not number of bytes) then the ndarray.tofile
method will be more efficient than f.write(arr.tostring())
.
Upvotes: 8