How to return a list of strings stored in an HDF5 file

Question

Hope someone can shed some light on this. I am trying to learn my way around with HDF5 files. Somehow this list of strings gets encoded into the file as a array of integers but I'm not able to figure out how to go about decoding it. I can plug the file back into pandas using the read_hdf function, but that's not the point - I am trying to understand the encoding logic. Summarized here is the example I was working with.

smiles.txt = 

structure
[11CH2]1NCCN2C[C@@H]3CCC[C@@H]3c4cccc1c24
[11CH2]1NCCN2[C@@H]3CCC[C@@H]3c4cccc1c24
[11CH3]c1ccc(cc1)c2cc(nn2c3ccc(cc3)S(=O)(=O)N)C(F)(F)F
[11CH3]c1ccccc1O[C@H]([C@@H]2CNCCO2)c3ccccc3
[11CH3]c1ccccc1S[C@H]([C@@H]2CNCCO2)c3ccccc3

>>> import pandas as pd
>>> df = pd.read_csv('smiles.txt', header=0)

>>> df.to_hdf('smiles.h5', 'table')

I then explore the structure of the newly created HDF5 file:

>>> import h5py
>>> with h5py.File('smiles.h5',"r") as f:
>>>    f.visit(print)

table
table/axis0
table/axis1
table/block0_items
table/block0_values

>>> with h5py.File('smiles_temp', 'r') as f:
>>>    print(list(f.keys()))
>>>    print(f['/thekey/axis0'][:])
>>>    print(f['/thekey/axis1'][:])
>>>    print(f['/thekey/block0_items'][:])
>>>    print(f['/thekey/block0_values'][:])

['thekey']
[b'structure']
[0 1 2 3 4]
[b'structure']
[array([128,   4, 149, 123,   1,   0,   0,   0,   0,   0,   0, 140,  21,
       110, 117, 109, 112, 121,  46,  99, 111, 114, 101,  46, 109, 117,
       108, 116, 105,  97, 114, 114,  97, 121, 148, 140,  12,  95, 114,
       101,  99, 111, 110, 115, 116, 114, 117,  99, 116, 148, 147, 148,
       140,   5, 110, 117, 109, 112, 121, 148, 140,   7, 110, 100,  97,
       114, 114,  97, 121, 148, 147, 148,  75,   0, 133, 148,  67,   1,
        98, 148, 135, 148,  82, 148,  40,  75,   1,  75,   5,  75,   1,
       134, 148, 104,   3, 140,   5, 100, 116, 121, 112, 101, 148, 147,
       148, 140,   2,  79,  56, 148,  75,   0,  75,   1, 135, 148,  82,
       148,  40,  75,   3, 140,   1, 124, 148,  78,  78,  78,  74, 255,
       255, 255, 255,  74, 255, 255, 255, 255,  75,  63, 116, 148,  98,
       137,  93, 148,  40, 140,  41,  91,  49,  49,  67,  72,  50,  93,
        49,  78,  67,  67,  78,  50,  67,  91,  67,  64,  64,  72,  93,
        51,  67,  67,  67,  91,  67,  64,  64,  72,  93,  51,  99,  52,
        99,  99,  99,  99,  49,  99,  50,  52, 148, 140,  40,  91,  49,
        49,  67,  72,  50,  93,  49,  78,  67,  67,  78,  50,  91,  67,
        64,  64,  72,  93,  51,  67,  67,  67,  91,  67,  64,  64,  72,
        93,  51,  99,  52,  99,  99,  99,  99,  49,  99,  50,  52, 148,
       140,  54,  91,  49,  49,  67,  72,  51,  93,  99,  49,  99,  99,
        99,  40,  99,  99,  49,  41,  99,  50,  99,  99,  40, 110, 110,
        50,  99,  51,  99,  99,  99,  40,  99,  99,  51,  41,  83,  40,
        61,  79,  41,  40,  61,  79,  41,  78,  41,  67,  40,  70,  41,
        40,  70,  41,  70, 148, 140,  44,  91,  49,  49,  67,  72,  51,
        93,  99,  49,  99,  99,  99,  99,  99,  49,  79,  91,  67,  64,
        72,  93,  40,  91,  67,  64,  64,  72,  93,  50,  67,  78,  67,
        67,  79,  50,  41,  99,  51,  99,  99,  99,  99,  99,  51, 148,
       140,  44,  91,  49,  49,  67,  72,  51,  93,  99,  49,  99,  99,
        99,  99,  99,  49,  83,  91,  67,  64,  72,  93,  40,  91,  67,
        64,  64,  72,  93,  50,  67,  78,  67,  67,  79,  50,  41,  99,
        51,  99,  99,  99,  99,  99,  51, 148, 101, 116, 148,  98,  46],
      dtype=uint8)]

How does one go about returning the list of strings using h5py?

hpaulj · Accepted Answer

Just to clarify, the dataframe displays as:

In [2]: df = pd.read_csv('stack63452223.csv', header=0)                                              
In [3]: df                                                                                           
Out[3]: 
                                           structure
0          [11CH2]1NCCN2C[C@@H]3CCC[C@@H]3c4cccc1c24
1           [11CH2]1NCCN2[C@@H]3CCC[C@@H]3c4cccc1c24
2  [11CH3]c1ccc(cc1)c2cc(nn2c3ccc(cc3)S(=O)(=O)N)...
3       [11CH3]c1ccccc1O[C@H]([C@@H]2CNCCO2)c3ccccc3
4       [11CH3]c1ccccc1S[C@H]([C@@H]2CNCCO2)c3ccccc3

In [11]: df._values                                                                                  
Out[11]: 
array([['[11CH2]1NCCN2C[C@@H]3CCC[C@@H]3c4cccc1c24'],
       ['[11CH2]1NCCN2[C@@H]3CCC[C@@H]3c4cccc1c24'],
       ['[11CH3]c1ccc(cc1)c2cc(nn2c3ccc(cc3)S(=O)(=O)N)C(F)(F)F'],
       ['[11CH3]c1ccccc1O[C@H]([C@@H]2CNCCO2)c3ccccc3'],
       ['[11CH3]c1ccccc1S[C@H]([C@@H]2CNCCO2)c3ccccc3']], dtype=object)

or as a list of strings:

In [24]: df['structure'].to_list()                                                                   
Out[24]: 
['[11CH2]1NCCN2C[C@@H]3CCC[C@@H]3c4cccc1c24',
 '[11CH2]1NCCN2[C@@H]3CCC[C@@H]3c4cccc1c24',
 '[11CH3]c1ccc(cc1)c2cc(nn2c3ccc(cc3)S(=O)(=O)N)C(F)(F)F',
 '[11CH3]c1ccccc1O[C@H]([C@@H]2CNCCO2)c3ccccc3',
 '[11CH3]c1ccccc1S[C@H]([C@@H]2CNCCO2)c3ccccc3']

The h5 is written by pytables, which is different from h5py; generally h5py can read pytables, but the details can be complicated.

The top level keys:

['axis0', 'axis1', 'block0_items', 'block0_values']

A dataframe has axes (row and column). On another occasion I looked at how a dataframe stores its values, and found that it uses blocks, each holding columns with a common dtype. Here you have 1 column, and it is object dtype, since it contains strings.

Strings are bit awkward in HDF5, especially unicode. numpy arrays use a unicode string dtype; pandas uses object dtype, referencing Python strings (stored outside the dataframe). I suspect then that in saving such a frame pytables is using a more complex referencing scheme (that isn't immediately obvious via h5py).

Guess that's a long answer to just say I don't know.

Pandas own h5 load:

In [19]: pd.read_hdf('stack63452223.h5', 'table')                                                    
Out[19]: 
                                           structure
0          [11CH2]1NCCN2C[C@@H]3CCC[C@@H]3c4cccc1c24
1           [11CH2]1NCCN2[C@@H]3CCC[C@@H]3c4cccc1c24
2  [11CH3]c1ccc(cc1)c2cc(nn2c3ccc(cc3)S(=O)(=O)N)...
3       [11CH3]c1ccccc1O[C@H]([C@@H]2CNCCO2)c3ccccc3
4       [11CH3]c1ccccc1S[C@H]([C@@H]2CNCCO2)c3ccccc3

The h5 objects also have attrs,

In [38]: f['table'].attrs.keys()                                                                     
Out[38]:

Fiddling around I found that:

In [66]: x=f['table']['block0_values'][0]                                                            
In [67]: b''.join(x.view('S1').tolist())                                                             
Out[67]: b'\x80\x04\x95y\x01\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x05K\x01\x86\x94h\x03\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94\x89\x88\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK?t\x94b\x89]\x94(\x8c)[11CH2]1NCCN2C[C@@H]3CCC[C@@H]3c4cccc1c24\x94\x8c([11CH2]1NCCN2[C@@H]3CCC[C@@H]3c4cccc1c24\x94\x8c6[11CH3]c1ccc(cc1)c2cc(nn2c3ccc(cc3)S(=O)(=O)N)C(F)(F)F\x94\x8c,[11CH3]c1ccccc1O[C@H]([C@@H]2CNCCO2)c3ccccc3\x94\x8c,[11CH3]c1ccccc1S[C@H]([C@@H]2CNCCO2)c3ccccc3\x94et\x94b.'

Looks like your strings are there. uint8 is a single byte dtype, which can be viewed as byte. Joining them I see your strings, concatenated in some fashion.

reformating:

Out[67]: b'\x80\x04\x95y\x01\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x05K\x01\x86\x94h\x03\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94\x89\x88\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK?t\x94b\x89]\x94(\x8c)
[11CH2]1NCCN2C[C@@H]3CCC[C@@H]3c4cccc1c24\x94\x8c(
[11CH2]1NCCN2[C@@H]3CCC[C@@H]3c4cccc1c24\x94\x8c6
[11CH3]c1ccc(cc1)c2cc(nn2c3ccc(cc3)S(=O)(=O)N)C(F)(F)F\x94\x8c,
[11CH3]c1ccccc1O[C@H]([C@@H]2CNCCO2)c3ccccc3\x94\x8c,
[11CH3]c1ccccc1S[C@H]([C@@H]2CNCCO2)c3ccccc3\x94et\x94b.'

How to return a list of strings stored in an HDF5 file

Answers (1)

Related Questions