lash
lash

Reputation: 756

Python numpy data pointer addresses change without modification

EDIT

After some more fiddling around, I've so far isolated the following states:

  1. A 1D array gives two different addresses when entering variable directly, and only one when using print()
  2. A 2D array (or matrix) gives three different addresses when entering variable directly, and two when using print()
  3. A 3D array gives two different address when entering variable directly, and only one when using print() (apparently the same as with the 1D array)

Like so:

>>> a = numpy.array([1,2,3], dtype="int32")

>>> a.data
<memory at 0x7f02e85e4048>
>>> a.data
<memory at 0x7f02e85e4110>
>>> a.data
<memory at 0x7f02e85e4048>
>>> a.data
<memory at 0x7f02e85e4110>
>>> a.data
<memory at 0x7f02e85e4048>

>>> print(a.data)
<memory at 0x7f02e85e4110>
>>> print(a.data)
<memory at 0x7f02e85e4110>
>>> print(a.data)
<memory at 0x7f02e85e4110>
>>> print(a.data)
<memory at 0x7f02e85e4110>
>>> print(a.data)
<memory at 0x7f02e85e4110>


>>> d = numpy.array([[1,2,3]], dtype="int32")

>>> d.data
<memory at 0x7f02e863ae48>
>>> d.data
<memory at 0x7f02e863a9e8>
>>> d.data
<memory at 0x7f02e863aac8>
>>> d.data
<memory at 0x7f02e863ae48>
>>> d.data
<memory at 0x7f02e863a9e8>
>>> d.data
<memory at 0x7f02e863aac8>

>>> print(d.data)
<memory at 0x7f02e863ae48>
>>> print(d.data)
<memory at 0x7f02e863a9e8>
>>> print(d.data)
<memory at 0x7f02e863ae48>
>>> print(d.data)
<memory at 0x7f02e863a9e8>
>>> print(d.data)
<memory at 0x7f02e863ae48>


>>> b = numpy.matrix([[1,2,3],[4,5,6]], dtype="int32")

>>> b.data
<memory at 0x7f02e863a9e8>
>>> b.data
<memory at 0x7f02e863ae48>
>>> b.data
<memory at 0x7f02e863aac8>
>>> b.data
<memory at 0x7f02e863a9e8>
>>> b.data
<memory at 0x7f02e863ae48>

>>> print(b.data)
<memory at 0x7f02e863aac8>
>>> print(b.data)
<memory at 0x7f02e863a9e8>
>>> print(b.data)
<memory at 0x7f02e863aac8>
>>> print(b.data)
<memory at 0x7f02e863a9e8>
>>> print(b.data)
<memory at 0x7f02e863aac8>


>>> c = numpy.matrix([[1,2,3],[4,5,6],[7,8,9]], dtype="int32")

>>> c.data
<memory at 0x7f02e863aac8>
>>> c.data
<memory at 0x7f02e863a9e8>
>>> c.data
<memory at 0x7f02e863ae48>
>>> c.data
<memory at 0x7f02e863aac8>
>>> c.data
<memory at 0x7f02e863ae48>
>>> c.data
<memory at 0x7f02e863a9e8>
>>> c.data
<memory at 0x7f02e863aac8>

>>> print(c.data)
<memory at 0x7f02e863ae48>
>>> print(c.data)
<memory at 0x7f02e863a9e8>
>>> print(c.data)
<memory at 0x7f02e863ae48>
>>> print(c.data)
<memory at 0x7f02e863a9e8>
>>> print(c.data)
<memory at 0x7f02e863ae48>


>>> e = numpy.array([[[0,1],[2,3]],[[4,5],[6,7]]], dtype="int32")

>>> e.data
<memory at 0x7f8ca0fe1048>
>>> e.data
<memory at 0x7f8ca0fe1140>
>>> e.data
<memory at 0x7f8ca0fe1048>
>>> e.data
<memory at 0x7f8ca0fe1140>
>>> e.data
<memory at 0x7f8ca0fe1048>


>>> print(e.data)
<memory at 0x7f8ca0fe1048>
>>> print(e.data)
<memory at 0x7f8ca0fe1048>
>>> print(e.data)
<memory at 0x7f8ca0fe1048>

ORIGINAL POST

I was under the impression that merely entering a variable along in the python console with echo a string simply describing the value (and type) of it. It formats in a different manner than print(), but I assumed the values they both returned would be the same.

When I try to output the address of the data pointer object of a numpy object, just entering the variable gives me different value every other time, while print() gives the same value.

That suggests that the difference in the two operations aren't just how the output is formatted, but also where they get their information from. But what exactly do these additional differences consist of?

>>> a = numpy.array([0,1,2])

>>> a
array([0, 1, 2])
>>> print(a)
[0 1 2]

>>> print(a.data)
<memory at 0x7ff25120c110>
>>> print(a.data)
<memory at 0x7ff25120c110>
>>> print(a.data)
<memory at 0x7ff25120c110>

>>> a.data
<memory at 0x7ff25120c110>
>>> a.data
<memory at 0x7ff253099818>
>>> a.data
<memory at 0x7ff25120c110>
>>> a.data
<memory at 0x7ff253099818>
>>> a.data
<memory at 0x7ff25120c110>

Upvotes: 6

Views: 1920

Answers (3)

benjimin
benjimin

Reputation: 4890

In python, object.attribute does not necessarily just lookup and retrieve a pre-stored variable. For example, it may instead execute a customisable function object.__getattr__("attribute") which can return anything at all (and could have arbitrary side-effects) and may even return different values if is invoked multiple times.

Don't confuse the raw memory allocation (where the values of a numpy.ndarray are stored) with the address of a memoryview object (that only stores metadata relating to a memory allocation).

The actual memory address of the array is given by ndarray.ctypes.data. This generates an integer, with the same value each time it is requested. (In cpython it actually generates a different int object each time, but this doesn't matter because it generates all of them to have the same value.)

>>> array = numpy.ones((10,10))
>>> address = array.ctypes.data
>>> address2 = array.ctypes.data
>>> address is address2
False
>>> address == address2
True
>>> address, address2, id(address), id(address2)
94626990418400, 94626990418400, 140364094130064, 140364094130224

Similarly, invoking ndarray.data generates a memoryview object, and each time you do this it will be a different, new memoryview object (although they will all store identical metadata since they are each describing the same array).

Unfortunately, when you try to print a memoryview instance to the console (i.e. when you ask the memoryview to generate a string representation of itself), it returns a sentence describing the location where this memoryview itself is located and not the location that this memoryview stores references to. (It presents this number in base 16, whereas python normally presents integers in base 10.)

>>> x = array.data
>>> repr(x), id(x), hex(id(x))
'<memory at 0x7fa90fff3a68>', 140364094585448, '0x7fa90fff3a68'

If you simply type array.data into a console repeatedly, you will likely see different hex values each time, because you are regenerating new memoryview objects (that all describe the same array).

You also may sometimes see cyclical repeats of the same hex values (assuming you do not assign a unique name for each memoryview object, either by not assigning a name at all or by reassigning the name to a subsequent object). This is because once an object is no longer needed (i.e. there are no longer any name-handles for your code to refer to that same particular instance of the object again) it is discarded, and the space it formerly occupied becomes freed for new objects to use instead. So if you repeatedly execute array.data you may sometimes find that a new memoryview object gets constructed at the exact same address where an earlier one had been.

Upvotes: 0

Jir
Jir

Reputation: 3145

From the docs

ndarray.data

Python buffer object pointing to the start of the array’s data.

Which should just be a memoryview of the data.

Edit, trying to be clearer:

In my case, a 1-D array gives new values every time - it doesn't cycle between two values only:

In [196]: a = numpy.array([0, 1, 2])

In [197]: a.data
Out[197]: <read-write buffer for 0x7f7de5934f80, size 24, offset 0 at 0x7f7de594d4b0>

In [198]: a.data
Out[198]: <read-write buffer for 0x7f7de5934f80, size 24, offset 0 at 0x7f7de594df70>

In [199]: a.data
Out[199]: <read-write buffer for 0x7f7de5934f80, size 24, offset 0 at 0x7f7de594d570>

In [200]: a.data
Out[200]: <read-write buffer for 0x7f7de5934f80, size 24, offset 0 at 0x7f7de594d870>

I think the behaviour is not peculiar to numpy only. See what happens with a buffer:

In [222]: a = ('123' * 999)

In [223]: buffer(a)
Out[223]: <read-only buffer for 0x7f7de003cbd0, size -1, offset 0 at 0x7f7de5955170>

In [224]: buffer(a)
Out[224]: <read-only buffer for 0x7f7de003cbd0, size -1, offset 0 at 0x7f7de594ddb0>

In [225]: buffer(a)
Out[225]: <read-only buffer for 0x7f7de003cbd0, size -1, offset 0 at 0x7f7de597a5b0>

In [226]: buffer(a)
Out[226]: <read-only buffer for 0x7f7de003cbd0, size -1, offset 0 at 0x7f7de594de70>

In the case of Buffer, the doc says (emphasis mine):

buffer(object[, offset[, size]])

The object argument must be an object that supports the buffer call interface (such as strings, arrays, and buffers). A new buffer object will be created which references the object argument.

So I guess we should expect the address memory to change. However, back to the original question, it seems that caching happens and I concur with you both, that it must be down to some sort of optimisation. Unfortunately, why and in which cases the caching happens, I cannot find out in the Python code base.

Upvotes: 1

The memoryview returned by a.data seems to alternate between two (or more) views. If you store a given instance of a.data, you get consistent output:

>>> a.data
<memory at 0x7fb88ea1f828>
>>> a.data
<memory at 0x7fb88e98c4a8>
>>> t = a.data
>>> a.data
<memory at 0x7fb88e98ce48>
>>> a.data
<memory at 0x7fb88e98c3c8>
>>> a.data
<memory at 0x7fb88e98c4a8>
>>> a.data
<memory at 0x7fb88e98ce48>
>>> a.data
<memory at 0x7fb88e98c3c8>
>>> a.data
<memory at 0x7fb88e98c4a8>
>>> t
<memory at 0x7fb88ea1f828>
>>> t
<memory at 0x7fb88ea1f828>
>>> t
<memory at 0x7fb88ea1f828>

Note that there are 3 addresses rotating in the above example; I'm pretty sure this is all an implementation detail. I would guess that some caching is involved, implying that a new view is not actually generated each time you access a.data.

You can also make certain that you are looking at separate view objects:

>>> id(a.data)
140430643088968
>>> id(a.data)
140430643086280
>>> id(a.data)
140430643088968
>>> id(a.data)
140430643086280

So most of the confusion probably comes from the fact that the attribute notation of a.data would suggest that it's a fixed object we're talking about, while this is not the case.

Upvotes: 2

Related Questions