Martin Thoma
Martin Thoma

Reputation: 136615

When should one use BytesIO .getvalue() instead of .getbuffer()?

According to the BytesIO docs:

getbuffer()

Return a readable and writable view over the contents of the buffer without copying them. Also, mutating the view will transparently update the contents of the buffer:

getvalue()

Return bytes containing the entire contents of the buffer.

So it seems as if getbuffer is more complicated. But if you don't need a writable view? Would you then simply use getvalue? What are the trade-offs?

Minimal Example

In this example, it seems as if they do exactly the same:

# Create an example
from io import BytesIO
bytesio_object = BytesIO(b"Hello World!")

# Write the stuff
with open("output.txt", "wb") as f:
    f.write(bytesio_object.getbuffer())

Upvotes: 21

Views: 6644

Answers (2)

Mr. B
Mr. B

Reputation: 2706

This question is old, but it looks like nobody has answered this sufficiently.

Simply:

  • obj.getbuffer() creates a memoryview object.
  • Every time you write, or if there is a memoryview of obj present, obj.getvalue() will need to create a new, complete value.
  • If you have not written (since creation or since the last obj.getvalue() call) and there is no memoryview present, obj.getvalue() is the fastest method of access, and requires no copies.

That being the case:

  • When creating another io.BytesIO, use obj.getvalue()
  • For random-access reading and writing, DEFINITELY use obj.getbuffer()
  • Avoid interpolating reading and writing frequently. If you must, then DEFINITELY use obj.getbuffer(), unless your file is tiny.
  • Avoid using obj.getvalue() while a buffer is laying around.

Here, we see that it's all fast, and all well and good if no buffer is laying around:


# time getvalue()
>>> i = io.BytesIO(b'f' * 1_000_000)
>>> %timeit i.getvalue()
34.6 ns ± 0.178 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

# time getbuffer()
>>> %timeit i.getbuffer()
118 ns ± 0.495 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

# time getbuffer() and getvalue() together
>>> %timeit i.getbuffer(); i.getvalue()
173 ns ± 0.829 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Everything is fine, and working about like you'd expect. But let's see what happens when there's a buffer just laying around:

>>> x = i.getbuffer()
>>> %timeit i.getvalue()
33 µs ± 675 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Notice that we're no longer measuring in nanoseconds, we're measuring in microseconds. That's multiple orders of magnitude slower. If you del x, we're back to being fast. This is all because while a memoryview exists, Python has to account for the possibility that the BytesIO may have been written to. So, to give a definite state to the user, it copies the buffer.

Upvotes: 5

Vad Sim
Vad Sim

Reputation: 316

Using getbuffer() is better, because, if you have really BIG data, copying them may take a long time. And (from PEP 20):

Explicit is better than implicit.

But value is undefined - it may be str or bytes. Buffer is always bytes.

Upvotes: 1

Related Questions