TWReever
TWReever

Reputation: 357

Nested structured array field access with numpy

I am working on parsing Matlab structured arrays in Python. For simplicity, the data structure ultimately consists of 3 fields, say header, body, trailer. Creating some data in Matlab for example:

header_data = {100, 100, 100};
body_data = {1234, 100, 4321};
trailer_data = {1001, 1001, 1001};
data = struct('header', header_data, 'body', body_data, 'trailer', trailer_data);

yields a 1x3 struct array.

This data is then read in Python as follows:

import scipy.io as sio
import numpy as np

matlab_data = sio.loadmat('data.mat', squeeze_me=True)
data = matlab['data']

This makes data a 1-dimensional numpy.ndarray of size 3 with dtype=dtype([('header', 'O'), ('body', 'O'), ('trailer', 'O')]), which I can happily iterate through using numpy.nditer and extract and parse the data from each struct.

The problem I'm trying to overcome is that unfortunately (and out of my control) in some of the files I need to parse, the above defined struct arrays are themselves a member of another struct array with a field msg. Continuing with my example in Matlab:

messages = struct('msg', {data(1), data(2), data(3)});

When this is loaded with scipy.loadmat in Python, it results in a 1-dimensional numpy.ndarray of size 3 with dtype=dtype([('msg', 'O')]). In order to reuse the same function for parsing the data fields, I'd need to have logic to detect the msg field, if it exists, and then extract each numpy.void from there before calling the function to parse the individual header, body and trailer fields.

In Matlab, this is easily overcome because the original 1x3 struct array with three fields can be extracted from the 1x3 struct array with the single msg field by doing: [messages.msg], which yields a 1x3 struct array with the header, body and trailer fields. If I try to translate this to numpy, the following command gives me a view of the original numpy.ndarray, which is not a structure (dtype=dtype('O')).

I'm trying to figure out if there an analogous way with numpy to recover the struct array with three fields from the one with the single msg field, as I can do in Matlab, or if I truly need to iterate over each value and manually extract it from the msg field before using a common parsing function. Again, the format of the Matlab input files is out of my control and I cannot change them; and my example here is only trivial compared to the number of nested fields I need to extract from the Matlab data.

Upvotes: 0

Views: 656

Answers (1)

hpaulj
hpaulj

Reputation: 231335

Trying to recreate your file with Octave (save with -v7), I get, in an Ipython session:

In [190]: data = io.loadmat('test.mat')
In [191]: data
Out[191]: 
{'__globals__': [],
 '__header__': b'MATLAB 5.0 MAT-file, written by Octave 4.0.0, 2016-10-04 20:54:53 UTC',
 '__version__': '1.0',
 'body_data': array([[array([[ 1234.]]), array([[ 100.]]), array([[ 4321.]])]], dtype=object),
 'data': array([[([[100.0]], [[1234.0]], [[1001.0]]),
         ([[100.0]], [[100.0]], [[1001.0]]),
         ([[100.0]], [[4321.0]], [[1001.0]])]], 
       dtype=[('header', 'O'), ('body', 'O'), ('trailer', 'O')]),
 'header_data': array([[array([[ 100.]]), array([[ 100.]]), array([[ 100.]])]], dtype=object),
 'messages': array([[([[(array([[ 100.]]), array([[ 1234.]]), array([[ 1001.]]))]],),
         ([[(array([[ 100.]]), array([[ 100.]]), array([[ 1001.]]))]],),
         ([[(array([[ 100.]]), array([[ 4321.]]), array([[ 1001.]]))]],)]], 
       dtype=[('msg', 'O')]),
 'trailer_data': array([[array([[ 1001.]]), array([[ 1001.]]), array([[ 1001.]])]], dtype=object)}

body_data, header_data, trailer_data are Octave cells, which in numpy are 2d objects arrays containing 2d elements

In [194]: data['trailer_data'][0,0]
Out[194]: array([[ 1001.]])
In [195]: data['trailer_data'][0,0][0,0]
Out[195]: 1001.0

data is a structured array (1,3) with 3 fields;

In [198]: data['data']['header'][0,0][0,0]
Out[198]: 100.0

messages is (1,3) with 1 field, with further nesting as with data.

In [208]: data['messages']['msg'][0,0]['header'][0,0][0,0]
Out[208]: 100.0

(This may be a repetition of what you describe, but I just want to clear about the data structure).

================

Playing around, I found that, can I strip out the (1,3) shape of msg, with indexing and concatenate:

In [241]: np.concatenate(data['messages']['msg'][0])
Out[241]: 
array([[([[100.0]], [[1234.0]], [[1001.0]])],
       [([[100.0]], [[100.0]], [[1001.0]])],
       [([[100.0]], [[4321.0]], [[1001.0]])]], 
      dtype=[('header', 'O'), ('body', 'O'), ('trailer', 'O')])
In [242]: data['data']
Out[242]: 
array([[([[100.0]], [[1234.0]], [[1001.0]]),
        ([[100.0]], [[100.0]], [[1001.0]]),
        ([[100.0]], [[4321.0]], [[1001.0]])]], 
      dtype=[('header', 'O'), ('body', 'O'), ('trailer', 'O')])

this looks the same as data.

For some reason I have to reduce it to a (3,)` array before the concatenate does what I want. I haven't wrapped my mind around those details.

Upvotes: 2

Related Questions