Explorer
Explorer

Reputation: 33

Read a matlab .mat file using h5py

I want to use Python3 package h5py to read a matlab .mat file of version 7.3.

It contains a variable in matlab, named results.

It contains a 1*1 cell, and the value in the struct inside is what I need.

In matlab, I can get these data through the following code:

load('.mat PATH');
results{1}.res

How should I read this data in h5py? Example .mat file can be obtained from here

Upvotes: 3

Views: 3888

Answers (3)

esg
esg

Reputation: 141

As I mentioned in my other post on the hd5fstorage package, I have run into problems of it being far too slow when it comes to loading deep arrays. So I implemented my own matfile loader whose code might also be more useful (because it's compact) if you care about the specifics of how reading a v7.3 matfile into Python works. (That said, the code currently has very few comments, so maybe not that useful.)

For the case of my library, the outputs are very similar to hdf5storage, as shown here.

In [0]: from MatFileMethods import LoadMatFile
In [1]: pyIn = LoadMatFile('/Users/emilio/Downloads/Basketball_ECO_HC.mat')
In [2]: type(pyIn)
Out[2]: dict
In [3]: pyIn.keys()
Out[3]: dict_keys(['results'])
In [4]: type(pyIn['results'])
Out[4]: numpy.ndarray
In [5]: pyIn['results'].shape
Out[5]: (1, 1)

Note that as with the hdf5storage package, the cell-within-a-cell in Matlab, which gets called using results{1}{1} becomes a two-dimensional numpy.ndarray which gets called with pyIn['results'][0,0], as below.

In [6]: type(pyIn['results'][0,0])
Out[6]: dict
In [7]: pyIn['results'][0,0].keys()
Out[7]: dict_keys(['annoBegin', 'fps', 'fps_no_ftr', 'len', 'res', 'startFrame', 'type'])
In [8]: pyIn['results'][0,0]['res'].shape
Out[8]: (725, 4)
In [9]: pyIn['results'][0,0]['res'][0,:]
Out[9]: array([198., 214.,  34.,  81.])

In contrast with hdf5storage, I opt to make Matlab structures into Python dicts, so that the fields of the structures are the keys of the dictionaries.

In any case, this module is by no means fully tested, but has served me well for loading ~500Mb and larger mat files that version 0.2 of hdf5storage doesn't seem to handle (~1.5 minutes for my own loader vs >10 minute loading time for hdf5storage (it hadn't finished loading at 10 minutes)). (I'll note that the 1.5 minutes still pales in comparison to Matlab's own <15s load times, so there's still room for improvement...)

Upvotes: 0

esg
esg

Reputation: 141

If your question is asking generally how to read matfiles saved using v7.3 in Python, the hdf5storage package provides some utilities that might work for you. In the case of your file (after installing the package) you would run

In [0]: import hdf5storage as hdf5
In [1]: pyIn = LoadMatFile('Basketball_ECO_HC.mat')
In [2]: type(pyIn)                                                                                                                                             
Out[2]: dict
In [3]: pyIn.keys()                                                                                                                                             
Out[3]: dict_keys(['results'])
In [4]: type(pyIn['results'])                                                                                                                                   
Out[4]: numpy.ndarray
In [5]: pyIn['results'].shape                                                                                                                                   
Out[5]: (1, 1)
In [6]: pyIn['results'].dtype                                                                                                                                   
Out[6]: dtype('O')
In [7]: pyIn['results'][0,0].dtype                                                                                                                              
Out[7]: dtype([('type', '<U4', (1, 1)), ('res', '<f8', (725, 4)), ('fps', '<f8', (1, 1)), ('fps_no_ftr', '<f8', (1, 1)), ('len', '<f8', (1, 1)), ('annoBegin', '<f8', (1, 1)), ('startFrame', '<f8', (1, 1))])

You can see it does a good job of parsing the input array, though it does things like collapsing the cell-in-a-cell that you would access in Matlab with results{1}{1} into a 2D numpy array you access with pyIn['results'][0,0] instead. Another odd thing I ran into with this data is the addition of a dimension in the deeper structure fields, as below:

In [8]: pyIn['results'][0,0]['res'].shape                                                                                        
Out[8]: (1, 725, 4)
In [9]: pyIn['results'][0,0]['res'][0,0,:]                                                                                                                      
Out[9]: array([198., 214.,  34.,  81.])

Not entirely sure why this happens, but in general it should work well.

That said, I did run into an issue with the latest version (0.2) of this package where for really deep array/cell/structure combos it became incredibly slow. The nice thing is that this package is still being maintained, so fixes for this might be in the pipeline. Nevertheless, this prompted me to write my own h5py reader for matfiles which is faster in these cases, and I'll discuss it as another answer.

Upvotes: 0

hpaulj
hpaulj

Reputation: 231385

While h5py can read h5 files from MATLAB, figuring out what is there takes some exploring - looking at keys groups and datasets (and possibly attr). There's nothing in scipy that will help you (scipy.io.loadmat is for the old MATLAB mat format).

With the downloaded file:

In [61]: f = h5py.File('Downloads/Basketball_ECO_HC.mat','r')
In [62]: f
Out[62]: <HDF5 file "Basketball_ECO_HC.mat" (mode r)>
In [63]: f.keys()
Out[63]: <KeysViewHDF5 ['#refs#', 'results']>
In [65]: f['results']
Out[65]: <HDF5 dataset "results": shape (1, 1), type "|O">
In [66]: arr = f['results'][:]
In [67]: arr
Out[67]: array([[<HDF5 object reference>]], dtype=object)
In [68]: arr.item()
Out[68]: <HDF5 object reference>

I'd have to check the h5py docs to see if I can check that object reference further. I'm not familiar with it.

But exploring the other key:

In [69]: list(f.keys())[0]
Out[69]: '#refs#'
In [70]: f[list(f.keys())[0]]
Out[70]: <HDF5 group "/#refs#" (2 members)>
In [71]: f[list(f.keys())[0]].keys()
Out[71]: <KeysViewHDF5 ['a', 'b']>
In [72]: f[list(f.keys())[0]]['a']
Out[72]: <HDF5 dataset "a": shape (2,), type "<u8">
In [73]: _[:]
Out[73]: array([0, 0], dtype=uint64)
In [74]: f[list(f.keys())[0]]['b']
Out[74]: <HDF5 group "/#refs#/b" (7 members)>
In [75]: f[list(f.keys())[0]]['b'].keys()
Out[75]: <KeysViewHDF5 ['annoBegin', 'fps', 'fps_no_ftr', 'len', 'res', 'startFrame', 'type']>
In [76]: f[list(f.keys())[0]]['b']['fps']
Out[76]: <HDF5 dataset "fps": shape (1, 1), type "<f8">
In [77]: f[list(f.keys())[0]]['b']['fps'][:]
Out[77]: array([[22.36617883]])

In the OS shell , I can look at the file with h5dump. From that it looks like the res dataset has the most data. The datasets also have attributes. That may be a better way of getting an overview, and use that to guide the h5py loads.

In [80]: f[list(f.keys())[0]]['b']['res'][:]
Out[80]: 
array([[198., 196., 195., ..., 330., 328., 326.],
       [214., 214., 216., ..., 197., 196., 192.],
       [ 34.,  34.,  34., ...,  34.,  34.,  34.],
       [ 81.,  81.,  81., ...,  81.,  80.,  80.]])
In [81]: f[list(f.keys())[0]]['b']['res'][:].shape
Out[81]: (4, 725)
In [82]: f[list(f.keys())[0]]['b']['res'][:].dtype
Out[82]: dtype('<f8')

Upvotes: 2

Related Questions