Reputation: 13090
Say I have a large NumPy array of dtype
int32
import numpy as np
N = 1000 # (large) number of elements
a = np.random.randint(0, 100, N, dtype=np.int32)
but now I want the data to be uint32
. I could do
b = a.astype(np.uint32)
or even
b = a.astype(np.uint32, copy=False)
but in both cases b
is a copy of a
, whereas I want to simply reinterpret the data in a
as being uint32
, as to not duplicate the memory. Similarly, using np.asarray()
does not help.
What does work is
a.dtpye = np.uint32
which simply changes the dtype
without altering the data at all. Here's a striking example:
import numpy as np
a = np.array([-1, 0, 1, 2], dtype=np.int32)
print(a)
a.dtype = np.uint32
print(a) # shows "overflow", which is what I want
My questions are about the solution of simply overwriting the dtype
of the array:
a
and b
sharing the same data, but view it as different dtype
s? I've found the following to work, but again I'm concerned if this is really OK to do:
import numpy as np
a = np.array([0, 1, 2, 3], dtype=np.int32)
b = a.view(np.uint32)
print(a) # [0 1 2 3]
print(b) # [0 1 2 3]
a[0] = -1
print(a) # [-1 1 2 3]
print(b) # [4294967295 1 2 3]
Though this seems to work, I find it weird that the underlying data
of the two arrays does not seem to be located the same place in memory:
print(a.data)
print(b.data)
Actually, it seems that the above gives different results each time it is run, so I don't understand what's going on there at all.dtype
s, the most extreme of which is probably mixing 32 and 64 bit floats:
import numpy as np
a = np.array([0, 1, 2, np.pi], dtype=np.float32)
b = a.view(np.float64)
print(a) # [0. 1. 2. 3.1415927]
print(b) # [0.0078125 50.12387848]
b[0] = 8
print(a) # [0. 2.5 2. 3.1415927]
print(b) # [8. 50.12387848]
Again, is this condoned, if the obtained behaviour is really what I'm after?Upvotes: 9
Views: 1507
Reputation: 50278
- Is this legitimate? Can you point me to where this feature is documented?
This is legitimate. However, using np.view
(which is equivalent) is better since it is compatible with a static analysers (so it is somehow safer). Indeed, the documentation states:
It’s possible to mutate the
dtype
of an array at runtime. [...] This sort of mutation is not allowed by the types. Users who want to write statically typed code should instead use thenumpy.ndarray.view
method to create a view of the array with a differentdtype
.
- Does it in fact leave the data of the array untouched, i.e. no duplication of the data?
Yes. Since the array is still a view on the same internal memory buffer (a basic byte array). Numpy will just reinterpret it differently (this is directly done the C code of each Numpy computing function).
- What if I want two arrays
a
andb
sharing the same data, but view it as differentdtypes
? [...]
np.view
can be used in this case as you did in your example. However, the result is platform dependent. Indeed, Numpy just reinterpret bytes of memory and theoretically the representation of negative numbers can change from one machine to another. Hopefully, nowadays, all mainstream modern processors use use the two's complement (source). This means that a np.in32
value like -1
will be reinterpreted as 2**32-1 = 4294967295
with a view of type np.uint32
. Positive signed values are unchanged. As long as you are aware of this, this is fine and the behaviour is predictable.
- This can be extended to other
dtypes
, the most extreme of which is probably mixing 32 and 64 bit floats.
Well, put it shortly, this is really like playing fire. In this case this certainly unsafe although it may work on your specific machine. Let us venturing into troubled waters.
First of all, the documentation of np.view
states:
The behavior of the view cannot be predicted just from the superficial appearance of
a
. It also depends on exactly howa
is stored in memory. Therefore ifa
is C-ordered versus fortran-ordered, versus defined as a slice or transpose, etc., the view may give different results.
The thing is Numpy reinterpret the pointer using a C code. Thus, AFAIK, the strict aliasing rule applies. This means that reinterpreting a np.float32
value to a np.float64
cause an undefined behaviour. One reason is that the alignment requirements are not the same for np.float32
(typically 4) and np.float32
(typically 8) and so reading an unaligned np.float64
value from memory can cause a crash on some architecture (eg. POWER) although x86-64 processors support this. Another reason comes from the compiler which can over-optimize the code due to the strict aliasing rule by making wrong assumptions in your case (like a np.float32
value and a np.float64
value cannot overlap in memory so the modification of the view should not change the original array). However, since Numpy is called from CPython and no function calls are inlined from the interpreter (probably not with Cython), this last point should not be a problem (it may be the case be if you use Numba or any JIT though). Note that this is safe to get an np.uint8
view of a np.float32
since it does not break the strict aliasing rule (and the alignment is Ok). This could be useful to efficiently serialize Numpy arrays. The opposite operation is not safe (especially due to the alignment).
Update about last section: a deeper analysis from the Numpy code show that some part of the code like type-conversion functions perform a safe type punning using the memmove
C call, while some other functions like all basic unary operators or binary ones do not appear to do a proper type punning yet! Moreover, such feature is barely tested by users and tricky corner cases are likely to cause weird bugs (especially if you read and write in two views of the same array). Thus, use it at your own risk.
Upvotes: 11