Reputation: 18859
I have a struct array created by matlab and stored in v7.3 format mat file:
struArray = struct('name', {'one', 'two', 'three'},
'id', {1,2,3},
'data', {[1:10], [3:9], [0]})
save('test.mat', 'struArray', '-v7.3')
Now I want to read this file via python using h5py:
data = h5py.File('test.mat')
struArray = data['/struArray']
I have no idea how to get the struct data one by one from struArray
:
for index in range(<the size of struArray>):
elem = <the index th struct in struArray>
name = <the name of elem>
id = <the id of elem>
data = <the data of elem>
Upvotes: 21
Views: 37582
Reputation: 681
I used the mat73 package, see mat73 github. It can be installed via pip and takes care of properly loading the .mat file similar to how scipy.io used to do it.
data_dict = mat73.loadmat('data.mat', use_attrdict=True)
Returns a data dict that returns the structure of the .mat file properly.
Upvotes: 0
Reputation: 141
I know of two solutions (one of which I made and works better if the *.mat
file is very large or very deep) that abstracts away your direct interactions with the h5py
library.
hdf5storage
package, which is well maintained and meant to help load v7.3 saved matfiles into Python0.2.0
) of hdf5storage
has loading large (~500Mb) and/or deep arrays (I'm actually not sure which of the two causes the issue)Assuming you've downloaded both packages into a place where you can load them into Python, you can see that they produce similar outputs for your example 'test.mat'
:
In [1]: pyInMine = LoadMatFile('test.mat')
In [2]: pyInHdf5 = hdf5.loadmat('test.mat')
In [3]: pyInMine()
Out[3]: dict_keys(['struArray'])
In [4]: pyInMine['struArray'].keys()
Out[4]: dict_keys(['data', 'id', 'name'])
In [5]: pyInHdf5.keys()
Out[5]: dict_keys(['struArray'])
In [6]: pyInHdf5['struArray'].dtype
Out[6]: dtype([('name', 'O'), ('id', '<f8', (1, 1)), ('data', 'O')])
In [7]: pyInHdf5['struArray']['data']
Out[7 ]:
array([[array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.]]),
array([[3., 4., 5., 6., 7., 8., 9.]]), array([[0.]])]],
dtype=object)
In [8]: pyInMine['struArray']['data']
Out[8]:
array([[array([[ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.]]),
array([[3., 4., 5., 6., 7., 8., 9.]]), array([[0.]])]],
dtype=object)
The big difference is that my library converts structure arrays in Matlab into Python dictionaries whose keys are the structure's fields, whereas hdf5storage
converts them into numpy
object arrays with various dtypes storing the fields.
I also note that the indexing behavior of the array is different from how you would expect it from the Matlab approach. Specifically, in Matlab, in order to get the name
field of the second structure, you would index the structure:
[Matlab] >> struArray(2).name`
[Matlab] >> 'two'
In my package, you have to first grab the field and then index:
In [9]: pyInMine['struArray'].shape
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-64-a2f85945642b> in <module>
----> 1 pyInMine['struArray'].shape
AttributeError: 'dict' object has no attribute 'shape'
In [10]: pyInMine['struArray']['name'].shape
Out[10]: (1, 3)
In [11]: pyInMine['struArray']['name'][0,1]
Out[11]: 'two'
The hdf5storage
package is a little bit nicer and lets you either index the structure and then grab the field, or vice versa, because of how structured numpy
object arrays work:
In [12]: pyInHdf5['struArray'].shape
Out[12]: (1, 3)
In [13]: pyInHdf5['struArray'][0,1]['name']
Out[13]: array([['two']], dtype='<U3')
In [14]: pyInHdf5['struArray']['name'].shape
Out[14]: (1, 3)
In [15]: pyInHdf5['struArray']['name'][0,1]
Out[15]: array([['two']], dtype='<U3')
Again, the two packages treat the final output a little differently, but in general are both quite good at reading in v7.3 matfiles. Final thought that in the case of ~500MB+ files, I've found that the hdf5storage
package hangs while loading, while my package does not (though it still takes ~1.5 minutes to complete the load).
Upvotes: 0
Reputation: 1178
Matlab 7.3 file format is not extremely easy to work with h5py. It relies on HDF5 reference, cf. h5py documentation on references.
>>> import h5py
>>> f = h5py.File('test.mat')
>>> list(f.keys())
['#refs#', 'struArray']
>>> struArray = f['struArray']
>>> struArray['name'][0, 0] # this is the HDF5 reference
<HDF5 object reference>
>>> f[struArray['name'][0, 0]].value # this is the actual data
array([[111],
[110],
[101]], dtype=uint16)
To read struArray(i).id
:
>>> f[struArray['id'][0, 0]][0, 0]
1.0
>>> f[struArray['id'][1, 0]][0, 0]
2.0
>>> f[struArray['id'][2, 0]][0, 0]
3.0
Notice that Matlab stores a number as an array of size (1, 1), hence the final [0, 0]
to get the number.
To read struArray(i).data
:
>>> f[struArray['data'][0, 0]].value
array([[ 1.],
[ 2.],
[ 3.],
[ 4.],
[ 5.],
[ 6.],
[ 7.],
[ 8.],
[ 9.],
[ 10.]])
To read struArray(i).name
, it is necessary to convert the array of integers to string:
>>> f[struArray['name'][0, 0]].value.tobytes()[::2].decode()
'one'
>>> f[struArray['name'][1, 0]].value.tobytes()[::2].decode()
'two'
>>> f[struArray['name'][2, 0]].value.tobytes()[::2].decode()
'three'
Upvotes: 19
Reputation: 4622
It's really a problem with Matlab 7.3 and h5py.
My trick is to convert the h5py._hl.dataset.Dataset
type to numpy
array.
For example,
np.array(data['data'])
will solve your problem with the 'data'
field.
Upvotes: -1
Reputation: 231385
visit
or visititems
is quick way of seeing the overall structure of a h5py
file:
fs['struArray'].visititems(lambda n,o:print(n, o))
When I run this on a file produced by Octave save -hdf5
I get:
type <HDF5 dataset "type": shape (), type "|S7">
value <HDF5 group "/struArray/value" (3 members)>
value/data <HDF5 group "/struArray/value/data" (2 members)>
value/data/type <HDF5 dataset "type": shape (), type "|S5">
value/data/value <HDF5 group "/struArray/value/data/value" (4 members)>
value/data/value/_0 <HDF5 group "/struArray/value/data/value/_0" (2 members)>
value/data/value/_0/type <HDF5 dataset "type": shape (), type "|S7">
value/data/value/_0/value <HDF5 dataset "value": shape (10, 1), type "<f8">
value/data/value/_1 <HDF5 group "/struArray/value/data/value/_1" (2 members)>
...
value/data/value/dims <HDF5 dataset "dims": shape (2,), type "<i4">
value/id <HDF5 group "/struArray/value/id" (2 members)>
value/id/type <HDF5 dataset "type": shape (), type "|S5">
value/id/value <HDF5 group "/struArray/value/id/value" (4 members)>
value/id/value/_0 <HDF5 group "/struArray/value/id/value/_0" (2 members)>
...
value/id/value/_2/value <HDF5 dataset "value": shape (), type "<f8">
value/id/value/dims <HDF5 dataset "dims": shape (2,), type "<i4">
value/name <HDF5 group "/struArray/value/name" (2 members)>
...
value/name/value/dims <HDF5 dataset "dims": shape (2,), type "<i4">
This may not be the same what MATLAB 7.3 produces, but it gives an idea of a structure's complexity.
A more refined callback can display values, and could be the starting point for recreating a Python object (dictionary, lists, etc).
def callback(name, obj):
if name.endswith('type'):
print('type:', obj.value)
elif name.endswith('value'):
if type(obj).__name__=='Dataset':
print(obj.value.T) # http://stackoverflow.com/questions/21624653
elif name.endswith('dims'):
print('dims:', obj.value)
else:
print('name:', name)
fs.visititems(callback)
produces:
name: struArray
type: b'struct'
name: struArray/value/data
type: b'cell'
name: struArray/value/data/value/_0
type: b'matrix'
[[ 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.]]
name: struArray/value/data/value/_1
type: b'matrix'
[[ 3. 4. 5. 6. 7. 8. 9.]]
name: struArray/value/data/value/_2
type: b'scalar'
0.0
dims: [3 1]
name: struArray/value/id
type: b'cell'
name: struArray/value/id/value/_0
type: b'scalar'
1.0
...
dims: [3 1]
name: struArray/value/name
type: b'cell'
name: struArray/value/name/value/_0
type: b'sq_string'
[[111 110 101]]
...
dims: [3 1]
Upvotes: 4
Reputation: 4207
I would start by firing up the interpreter and running help
on struarray
. It should give you enough information to get you started. Failing that, you can dump the attributes of any Python object by print
ing the __dict__
attribute.
Upvotes: 0
Reputation: 10967
I'm sorry but I think it will be quite challenging to get contents of cells/structures from outside Matlab. If you view the produced files (eg with HDFView) you will see there are lots of cross-references and no obvious way to proceed.
If you stick to simple numerical arrays it works fine. If you have small cell arrays containing numerical arrays you can convert them to seperate variables (ie cellcontents1, cellcontents2 etc.) which is usually just a few lines and allows them to be saved and loaded directly. So in your example I would save a file with vars name1, name2, name3, id1, id2, id3 ...
etc.
EDIT: You specified h5py in the question so thats what I answered, but worth mentioning that with scipy.io.loadmat
you should be able to get the original variables converted to numpy equivalents (eg object arrays).
Upvotes: 0