jmd_dk
jmd_dk

Reputation: 13090

Reinterpreting NumPy arrays as a different dtype

Say I have a large NumPy array of dtype int32

import numpy as np
N = 1000  # (large) number of elements
a = np.random.randint(0, 100, N, dtype=np.int32)

but now I want the data to be uint32. I could do

b = a.astype(np.uint32)

or even

b = a.astype(np.uint32, copy=False)

but in both cases b is a copy of a, whereas I want to simply reinterpret the data in a as being uint32, as to not duplicate the memory. Similarly, using np.asarray() does not help.

What does work is

a.dtpye = np.uint32

which simply changes the dtype without altering the data at all. Here's a striking example:

import numpy as np
a = np.array([-1, 0, 1, 2], dtype=np.int32)
print(a)
a.dtype = np.uint32
print(a)  # shows "overflow", which is what I want

My questions are about the solution of simply overwriting the dtype of the array:

  1. Is this legitimate? Can you point me to where this feature is documented?
  2. Does it in fact leave the data of the array untouched, i.e. no duplication of the data?
  3. What if I want two arrays a and b sharing the same data, but view it as different dtypes? I've found the following to work, but again I'm concerned if this is really OK to do:
    import numpy as np
    a = np.array([0, 1, 2, 3], dtype=np.int32)
    b = a.view(np.uint32)
    print(a)  # [0  1  2  3]
    print(b)  # [0  1  2  3]
    a[0] = -1
    print(a)  # [-1  1  2  3]
    print(b)  # [4294967295  1  2  3]
    
    Though this seems to work, I find it weird that the underlying data of the two arrays does not seem to be located the same place in memory:
    print(a.data)
    print(b.data)
    
    Actually, it seems that the above gives different results each time it is run, so I don't understand what's going on there at all.
  4. This can be extended to other dtypes, the most extreme of which is probably mixing 32 and 64 bit floats:
    import numpy as np
    a = np.array([0, 1, 2, np.pi], dtype=np.float32)
    b = a.view(np.float64)
    print(a)  # [0.  1.  2.  3.1415927]
    print(b)  # [0.0078125  50.12387848]
    b[0] = 8
    print(a)  # [0.  2.5  2.  3.1415927]
    print(b)  # [8.  50.12387848]
    
    Again, is this condoned, if the obtained behaviour is really what I'm after?

Upvotes: 9

Views: 1507

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50278

  1. Is this legitimate? Can you point me to where this feature is documented?

This is legitimate. However, using np.view (which is equivalent) is better since it is compatible with a static analysers (so it is somehow safer). Indeed, the documentation states:

It’s possible to mutate the dtype of an array at runtime. [...] This sort of mutation is not allowed by the types. Users who want to write statically typed code should instead use the numpy.ndarray.view method to create a view of the array with a different dtype.

  1. Does it in fact leave the data of the array untouched, i.e. no duplication of the data?

Yes. Since the array is still a view on the same internal memory buffer (a basic byte array). Numpy will just reinterpret it differently (this is directly done the C code of each Numpy computing function).

  1. What if I want two arrays a and b sharing the same data, but view it as different dtypes? [...]

np.view can be used in this case as you did in your example. However, the result is platform dependent. Indeed, Numpy just reinterpret bytes of memory and theoretically the representation of negative numbers can change from one machine to another. Hopefully, nowadays, all mainstream modern processors use use the two's complement (source). This means that a np.in32 value like -1 will be reinterpreted as 2**32-1 = 4294967295 with a view of type np.uint32. Positive signed values are unchanged. As long as you are aware of this, this is fine and the behaviour is predictable.

  1. This can be extended to other dtypes, the most extreme of which is probably mixing 32 and 64 bit floats.

Well, put it shortly, this is really like playing fire. In this case this certainly unsafe although it may work on your specific machine. Let us venturing into troubled waters.

First of all, the documentation of np.view states:

The behavior of the view cannot be predicted just from the superficial appearance of a. It also depends on exactly how a is stored in memory. Therefore if a is C-ordered versus fortran-ordered, versus defined as a slice or transpose, etc., the view may give different results.

The thing is Numpy reinterpret the pointer using a C code. Thus, AFAIK, the strict aliasing rule applies. This means that reinterpreting a np.float32 value to a np.float64 cause an undefined behaviour. One reason is that the alignment requirements are not the same for np.float32 (typically 4) and np.float32 (typically 8) and so reading an unaligned np.float64 value from memory can cause a crash on some architecture (eg. POWER) although x86-64 processors support this. Another reason comes from the compiler which can over-optimize the code due to the strict aliasing rule by making wrong assumptions in your case (like a np.float32 value and a np.float64 value cannot overlap in memory so the modification of the view should not change the original array). However, since Numpy is called from CPython and no function calls are inlined from the interpreter (probably not with Cython), this last point should not be a problem (it may be the case be if you use Numba or any JIT though). Note that this is safe to get an np.uint8 view of a np.float32 since it does not break the strict aliasing rule (and the alignment is Ok). This could be useful to efficiently serialize Numpy arrays. The opposite operation is not safe (especially due to the alignment).

Update about last section: a deeper analysis from the Numpy code show that some part of the code like type-conversion functions perform a safe type punning using the memmove C call, while some other functions like all basic unary operators or binary ones do not appear to do a proper type punning yet! Moreover, such feature is barely tested by users and tricky corner cases are likely to cause weird bugs (especially if you read and write in two views of the same array). Thus, use it at your own risk.

Upvotes: 11

Related Questions