Reputation: 357
I am working on parsing Matlab structured arrays in Python. For simplicity, the data structure ultimately consists of 3 fields, say header, body, trailer. Creating some data in Matlab for example:
header_data = {100, 100, 100};
body_data = {1234, 100, 4321};
trailer_data = {1001, 1001, 1001};
data = struct('header', header_data, 'body', body_data, 'trailer', trailer_data);
yields a 1x3 struct array.
This data is then read in Python as follows:
import scipy.io as sio
import numpy as np
matlab_data = sio.loadmat('data.mat', squeeze_me=True)
data = matlab['data']
This makes data
a 1-dimensional numpy.ndarray
of size 3 with dtype=dtype([('header', 'O'), ('body', 'O'), ('trailer', 'O')])
, which I can happily iterate through using numpy.nditer
and extract and parse the data from each struct.
The problem I'm trying to overcome is that unfortunately (and out of my control) in some of the files I need to parse, the above defined struct arrays are themselves a member of another struct array with a field msg
. Continuing with my example in Matlab:
messages = struct('msg', {data(1), data(2), data(3)});
When this is loaded with scipy.loadmat
in Python, it results in a 1-dimensional numpy.ndarray
of size 3 with dtype=dtype([('msg', 'O')])
. In order to reuse the same function for parsing the data fields, I'd need to have logic to detect the msg
field, if it exists, and then extract each numpy.void
from there before calling the function to parse the individual header, body and trailer fields.
In Matlab, this is easily overcome because the original 1x3 struct array with three fields can be extracted from the 1x3 struct array with the single msg
field by doing: [messages.msg]
, which yields a 1x3 struct array with the header, body and trailer fields. If I try to translate this to numpy, the following command gives me a view of the original numpy.ndarray
, which is not a structure (dtype=dtype('O')
).
I'm trying to figure out if there an analogous way with numpy
to recover the struct array with three fields from the one with the single msg
field, as I can do in Matlab, or if I truly need to iterate over each value and manually extract it from the msg
field before using a common parsing function. Again, the format of the Matlab input files is out of my control and I cannot change them; and my example here is only trivial compared to the number of nested fields I need to extract from the Matlab data.
Upvotes: 0
Views: 656
Reputation: 231335
Trying to recreate your file with Octave (save with -v7), I get, in an Ipython session:
In [190]: data = io.loadmat('test.mat')
In [191]: data
Out[191]:
{'__globals__': [],
'__header__': b'MATLAB 5.0 MAT-file, written by Octave 4.0.0, 2016-10-04 20:54:53 UTC',
'__version__': '1.0',
'body_data': array([[array([[ 1234.]]), array([[ 100.]]), array([[ 4321.]])]], dtype=object),
'data': array([[([[100.0]], [[1234.0]], [[1001.0]]),
([[100.0]], [[100.0]], [[1001.0]]),
([[100.0]], [[4321.0]], [[1001.0]])]],
dtype=[('header', 'O'), ('body', 'O'), ('trailer', 'O')]),
'header_data': array([[array([[ 100.]]), array([[ 100.]]), array([[ 100.]])]], dtype=object),
'messages': array([[([[(array([[ 100.]]), array([[ 1234.]]), array([[ 1001.]]))]],),
([[(array([[ 100.]]), array([[ 100.]]), array([[ 1001.]]))]],),
([[(array([[ 100.]]), array([[ 4321.]]), array([[ 1001.]]))]],)]],
dtype=[('msg', 'O')]),
'trailer_data': array([[array([[ 1001.]]), array([[ 1001.]]), array([[ 1001.]])]], dtype=object)}
body_data
, header_data
, trailer_data
are Octave cells, which in numpy
are 2d objects arrays containing 2d elements
In [194]: data['trailer_data'][0,0]
Out[194]: array([[ 1001.]])
In [195]: data['trailer_data'][0,0][0,0]
Out[195]: 1001.0
data
is a structured array (1,3) with 3 fields;
In [198]: data['data']['header'][0,0][0,0]
Out[198]: 100.0
messages
is (1,3) with 1 field, with further nesting as with data
.
In [208]: data['messages']['msg'][0,0]['header'][0,0][0,0]
Out[208]: 100.0
(This may be a repetition of what you describe, but I just want to clear about the data structure).
================
Playing around, I found that, can I strip out the (1,3)
shape of msg
, with indexing and concatenate:
In [241]: np.concatenate(data['messages']['msg'][0])
Out[241]:
array([[([[100.0]], [[1234.0]], [[1001.0]])],
[([[100.0]], [[100.0]], [[1001.0]])],
[([[100.0]], [[4321.0]], [[1001.0]])]],
dtype=[('header', 'O'), ('body', 'O'), ('trailer', 'O')])
In [242]: data['data']
Out[242]:
array([[([[100.0]], [[1234.0]], [[1001.0]]),
([[100.0]], [[100.0]], [[1001.0]]),
([[100.0]], [[4321.0]], [[1001.0]])]],
dtype=[('header', 'O'), ('body', 'O'), ('trailer', 'O')])
this looks the same as data
.
For some reason I have to reduce it to a (3,)` array before the concatenate does what I want. I haven't wrapped my mind around those details.
Upvotes: 2