Reputation: 23068
The following code constructs a NumPy array with a dtype object:
dt = np.dtype([
("index", np.int32),
("timestamp", np.int32),
("volume", np.float32)
])
arr = np.array([
[0, 20, 3],
[1, 21, 2],
[2, 23, 8],
[3, 26, 5],
[4, 31, 9]
]).astype(dt)
The expected result of arr
would be:
>>> arr
array([[ 0, 20, 334.],
[ 1, 21, 254.],
[ 2, 23, 823.],
[ 3, 26, 521.],
[ 4, 31, 943.]])
>>> arr[0]
array([ 0, 20, 334.])
But what the code above is creating is actually this:
>>> arr
array([[( 0, 0, 0.), ( 20, 20, 20.), (334, 334, 334.)],
[( 1, 1, 1.), ( 21, 21, 21.), (254, 254, 254.)],
[( 2, 2, 2.), ( 23, 23, 23.), (823, 823, 823.)],
[( 3, 3, 3.), ( 26, 26, 26.), (521, 521, 521.)],
[( 4, 4, 4.), ( 31, 31, 31.), (943, 943, 943.)]],
dtype=[('index', '<i4'), ('timestamp', '<i4'), ('volume', '<f4')])
>>> arr[0]
array([( 0, 0, 0.), ( 20, 20, 20.), (334, 334, 334.)],
dtype=[('index', '<i4'), ('timestamp', '<i4'), ('volume', '<f4')])
Why is NumPy creating a version of every value for every data type instead of mapping each column to its own data type (and only this one)? I'm guessing that I did something wrong there. Is there a way to get to the result I was expecting?
Upvotes: 2
Views: 1107
Reputation: 88236
The issue here is that for the structured array creation you need a list of tuples. This is mentioned in Structured Datatype Creation, where it states that among other less common methods of array creation, the input data must be a list of tuples, one tuple per field.
So what you can do is turn your array into a list of tuples (zip
will be convenient here) and build the structured array from it using np.fromiter
and specifying dt
as dtype
:
np.fromiter(zip(*arr.T), dtype=dt)
array([(0, 20, 3.), (1, 21, 2.), (2, 23, 8.), (3, 26, 5.), (4, 31, 9.)],
dtype=[('index', '<i4'), ('timestamp', '<i4'), ('volume', '<f4')])
Another (lesser known) approach as mentioned by @hpaulj in the comments, is using np.lib.recfunctions.unstructured_to_structured
, which can be used to directly construct the structured array from arr
and the dtype object with:
np.lib.recfunctions.unstructured_to_structured(a, dt)
array([(0, 20, 3.), (1, 21, 2.), (2, 23, 8.), ..., (2, 23, 8.),
(3, 26, 5.), (4, 31, 9.)],
dtype=[('index', '<i4'), ('timestamp', '<i4'), ('volume', '<f4')])
Or based on this other post there's also the possibility to create a record array, an ndarray subclass, very similar to a structured array in terms of usage, that comes with several associated helper functions, such as np.core.records.fromarrays
that can be used for the creation of the array as in a simple way:
np.core.records.fromarrays(arr.T,
names='index, timestamp, volume',
formats = '<i4, <i4, <f4')
rec.array([(0, 20, 3.), (1, 21, 2.), (2, 23, 8.), (3, 26, 5.),
(4, 31, 9.)],
dtype=[('index', '<i4'), ('timestamp', '<i4'), ('volume', '<f4')])
Or to create it from the np.dtype
object:
names, dtypes = list(zip(*dt.descr))
np.core.records.fromarrays(arr.transpose(),
names= ', '.join(names),
formats = ', '.join(dtypes))
Timings comparing the mentioned methods, and some other possible approaches:
a = np.concatenate([arr]*1000, axis=0)
%%timeit
np.core.records.fromarrays(a.T,
names='index, timestamp, volume',
formats = '<i4, <i4, <f4')
# 57.9 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.lib.recfunctions.unstructured_to_structured(a, dt)
# 79.6 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.fromiter(zip(*a.T), dtype=dt)
#2.1 ms ± 69.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.fromiter(map(tuple, a), dtype=dt)
#6.34 ms ± 65.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.array(list(zip(*a.T)), dtype=dt)
# 2.17 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 2