'/' in names in HDF5 files confusion

I am experiencing some really weird interactions between h5py, PyTables (via Pandas), and C++ generated HDF5 files. It seems that, h5check and h5py seem to cope with type names containing '/' but pandas/PyTables cannot. Clearly, there is a gap in my understanding, so:

What have I not understood here?


The gory details

I have the following data in a HDF5 file:

   [...]
   DATASET "log" {
      DATATYPE  H5T_COMPOUND {
         H5T_COMPOUND {
            H5T_STD_U32LE "sec";
            H5T_STD_U32LE "usec";
         } "time";
         H5T_IEEE_F32LE "CIF/align/aft_port_end/extend_pressure";
         [...]

This was created via the C++ API. The h5check utility says the file is valid.

Note that CIF/align/aft_port_end/extend_pressure is not meant as a path to a group/node/leaf. It is a label, that we use internally which happens to have some internal structure that contains '/' as delimiters. We do not want the HDF5 file to know anything about that: it should not care. Clearly, if '/' are illegal in any HDF5 name, then we have to change that delimiter to something else.

Using PyTables (okay, Pandas but it uses PyTables internally) to read the file, I get an

 >>> import pandas as pd
 >>> store = pd.HDFStore('data/XXX-20150423-071618.h5')
 >>> store
/home/XXX/virt/env/develop/lib/python2.7/site-packages/tables/group. py:1156: UserWarning: problems loading leaf ``/log``::

  the ``/`` character is not allowed in object names: 'XXX/align/aft_port_end/extend_pressure'

The leaf will become an ``UnImplemented`` node. 

I asked about this in this question and got told that '/' are illegal in the specification. However, things get stranger with h5py...

Using h5py to read the file, I get what I want:

>>> f['/log'].dtype
>>> dtype([('time', [('sec', '<u4'), ('usec', '<u4')]), ('CI
F/align/aft_port_end/extend_pressure', '<f4')[...]

Which is more or less what I set out with.

Needless to say, I am confused. Have I managed to create an illegal HDF5 file that somehow passes h5check? Is PyTables not supporting this edge case? ... I am confused.


Clearly, I could write a simple wrapper something like this:

>>> import matplotlib.pyplot as plt
>>> silly = pd.DataFrame(f['/log']['CIF/align/aft_port_end/extend_pressure'])
>>> silly.plot()
>>> plt.show()

to get all the data from the HDF5 file into Pandas. However, I am not sure if this is a good idea because of the confusion earlier. My biggest worry is the conversion might not scale if the data is very large...

Upvotes: 19

Views: 3391

Answers (3)

tmthydvnprt
tmthydvnprt

Reputation: 10758

Could you use h5py to read thru all your files and rewrite them without the offending characters, so that pytables can read them?

If it is outside the spec, I assume what you are experiencing is just that some implementations handle it and others do not...

Upvotes: 2

titusjan
titusjan

Reputation: 5546

I've browsed a bit through the h5check source and I can't find any place where it tests if a name contains a slash. You can examine the error messages it can produce with:

grep error_push h5checker.c -A1

The links you provided clearly state that slashes are not allowed in object names. So yes, I think you've made a file that is illegal but passes h5check. The tool seems to focus more on the binary data layout. The closest related check I can find is a guard against duplicate names.

In my opinion that's all there is to it. The fact that h5py and other libraries somehow are able to create or read this illegal file is irrelevant. The spec says "don't put slashes in object names", so you don't. End of story.

If you're not convinced, think of it like this: if you somehow managed to create a regular file with a slash in its file name, what would happen? Most programs assume that file names contains no slashes and thus that they are able to partition a directory path by splitting it at the slash characters. Your file would break this behavior and so introduce many subtle (and not so subtle) bugs. Users would complain, programmers would hate you, system administrators would curse you.

Likewise it's safe to assume that, next to PyTables, many other libraries and programs will not be able to handle slashes in variable names. The nice thing about HDF is that so many tools exist for it, and by using slashes you throw away that advantage. You may think that this this is not important, perhaps your HDF-5 files are for internal use only. However, the situation may change in 5 years, as situations tend to do.

Just bite the bullet and replace '/' with '|' before writing your variables to HDF5. Replace them back when you read them. The time you lose by implementing this, you'll win back x-fold (for x>1) by avoiding future bugs and user complaints.

Sorry about the rant but I hope to have convinced you.

Upvotes: 7

Jason Newton
Jason Newton

Reputation: 1211

Make sure you are creating groups rather than just the path name out right - this is probably where the fault creeps in. If you create the groups to your objects and then name the objects with the leaf names (extend_pressure in above) you won't have any problems anywhere.

H5py is a pretty thin wrapper around the C HDF5 library, pandas/pytables are a lot more heavy weight in approach - or at least they have alot more of their own semantics going on - and so they are checking to make sure you don't have '/' in your object names. But keep in mind everybody is using the HDF5 library at the end of the day because while HDF5 is great, it would be a huge effort to make an alternative implementation - beyond the resources of Pandas/Pytables.

Minor disclaimer: I've hacked on internals of HDF5 and H5py before.

Upvotes: 0

Related Questions