Reputation: 1087
Hope someone can shed some light on this. I am trying to learn my way around with HDF5 files. Somehow this list of strings gets encoded into the file as a array of integers but I'm not able to figure out how to go about decoding it. I can plug the file back into pandas using the read_hdf function, but that's not the point - I am trying to understand the encoding logic. Summarized here is the example I was working with.
smiles.txt =
structure
[11CH2]1NCCN2C[C@@H]3CCC[C@@H]3c4cccc1c24
[11CH2]1NCCN2[C@@H]3CCC[C@@H]3c4cccc1c24
[11CH3]c1ccc(cc1)c2cc(nn2c3ccc(cc3)S(=O)(=O)N)C(F)(F)F
[11CH3]c1ccccc1O[C@H]([C@@H]2CNCCO2)c3ccccc3
[11CH3]c1ccccc1S[C@H]([C@@H]2CNCCO2)c3ccccc3
>>> import pandas as pd
>>> df = pd.read_csv('smiles.txt', header=0)
>>> df.to_hdf('smiles.h5', 'table')
I then explore the structure of the newly created HDF5 file:
>>> import h5py
>>> with h5py.File('smiles.h5',"r") as f:
>>> f.visit(print)
table
table/axis0
table/axis1
table/block0_items
table/block0_values
>>> with h5py.File('smiles_temp', 'r') as f:
>>> print(list(f.keys()))
>>> print(f['/thekey/axis0'][:])
>>> print(f['/thekey/axis1'][:])
>>> print(f['/thekey/block0_items'][:])
>>> print(f['/thekey/block0_values'][:])
['thekey']
[b'structure']
[0 1 2 3 4]
[b'structure']
[array([128, 4, 149, 123, 1, 0, 0, 0, 0, 0, 0, 140, 21,
110, 117, 109, 112, 121, 46, 99, 111, 114, 101, 46, 109, 117,
108, 116, 105, 97, 114, 114, 97, 121, 148, 140, 12, 95, 114,
101, 99, 111, 110, 115, 116, 114, 117, 99, 116, 148, 147, 148,
140, 5, 110, 117, 109, 112, 121, 148, 140, 7, 110, 100, 97,
114, 114, 97, 121, 148, 147, 148, 75, 0, 133, 148, 67, 1,
98, 148, 135, 148, 82, 148, 40, 75, 1, 75, 5, 75, 1,
134, 148, 104, 3, 140, 5, 100, 116, 121, 112, 101, 148, 147,
148, 140, 2, 79, 56, 148, 75, 0, 75, 1, 135, 148, 82,
148, 40, 75, 3, 140, 1, 124, 148, 78, 78, 78, 74, 255,
255, 255, 255, 74, 255, 255, 255, 255, 75, 63, 116, 148, 98,
137, 93, 148, 40, 140, 41, 91, 49, 49, 67, 72, 50, 93,
49, 78, 67, 67, 78, 50, 67, 91, 67, 64, 64, 72, 93,
51, 67, 67, 67, 91, 67, 64, 64, 72, 93, 51, 99, 52,
99, 99, 99, 99, 49, 99, 50, 52, 148, 140, 40, 91, 49,
49, 67, 72, 50, 93, 49, 78, 67, 67, 78, 50, 91, 67,
64, 64, 72, 93, 51, 67, 67, 67, 91, 67, 64, 64, 72,
93, 51, 99, 52, 99, 99, 99, 99, 49, 99, 50, 52, 148,
140, 54, 91, 49, 49, 67, 72, 51, 93, 99, 49, 99, 99,
99, 40, 99, 99, 49, 41, 99, 50, 99, 99, 40, 110, 110,
50, 99, 51, 99, 99, 99, 40, 99, 99, 51, 41, 83, 40,
61, 79, 41, 40, 61, 79, 41, 78, 41, 67, 40, 70, 41,
40, 70, 41, 70, 148, 140, 44, 91, 49, 49, 67, 72, 51,
93, 99, 49, 99, 99, 99, 99, 99, 49, 79, 91, 67, 64,
72, 93, 40, 91, 67, 64, 64, 72, 93, 50, 67, 78, 67,
67, 79, 50, 41, 99, 51, 99, 99, 99, 99, 99, 51, 148,
140, 44, 91, 49, 49, 67, 72, 51, 93, 99, 49, 99, 99,
99, 99, 99, 49, 83, 91, 67, 64, 72, 93, 40, 91, 67,
64, 64, 72, 93, 50, 67, 78, 67, 67, 79, 50, 41, 99,
51, 99, 99, 99, 99, 99, 51, 148, 101, 116, 148, 98, 46],
dtype=uint8)]
How does one go about returning the list of strings using h5py?
Upvotes: 0
Views: 288
Reputation: 231665
Just to clarify, the dataframe displays as:
In [2]: df = pd.read_csv('stack63452223.csv', header=0)
In [3]: df
Out[3]:
structure
0 [11CH2]1NCCN2C[C@@H]3CCC[C@@H]3c4cccc1c24
1 [11CH2]1NCCN2[C@@H]3CCC[C@@H]3c4cccc1c24
2 [11CH3]c1ccc(cc1)c2cc(nn2c3ccc(cc3)S(=O)(=O)N)...
3 [11CH3]c1ccccc1O[C@H]([C@@H]2CNCCO2)c3ccccc3
4 [11CH3]c1ccccc1S[C@H]([C@@H]2CNCCO2)c3ccccc3
In [11]: df._values
Out[11]:
array([['[11CH2]1NCCN2C[C@@H]3CCC[C@@H]3c4cccc1c24'],
['[11CH2]1NCCN2[C@@H]3CCC[C@@H]3c4cccc1c24'],
['[11CH3]c1ccc(cc1)c2cc(nn2c3ccc(cc3)S(=O)(=O)N)C(F)(F)F'],
['[11CH3]c1ccccc1O[C@H]([C@@H]2CNCCO2)c3ccccc3'],
['[11CH3]c1ccccc1S[C@H]([C@@H]2CNCCO2)c3ccccc3']], dtype=object)
or as a list of strings:
In [24]: df['structure'].to_list()
Out[24]:
['[11CH2]1NCCN2C[C@@H]3CCC[C@@H]3c4cccc1c24',
'[11CH2]1NCCN2[C@@H]3CCC[C@@H]3c4cccc1c24',
'[11CH3]c1ccc(cc1)c2cc(nn2c3ccc(cc3)S(=O)(=O)N)C(F)(F)F',
'[11CH3]c1ccccc1O[C@H]([C@@H]2CNCCO2)c3ccccc3',
'[11CH3]c1ccccc1S[C@H]([C@@H]2CNCCO2)c3ccccc3']
The h5
is written by pytables
, which is different from h5py
; generally h5py
can read pytables
, but the details can be complicated.
The top level keys:
['axis0', 'axis1', 'block0_items', 'block0_values']
A dataframe has axes (row and column). On another occasion I looked at how a dataframe stores its values, and found that it uses blocks
, each holding columns with a common dtype. Here you have 1 column, and it is object
dtype, since it contains strings.
Strings are bit awkward in HDF5
, especially unicode. numpy
arrays use a unicode string dtype; pandas
uses object dtype, referencing Python strings (stored outside the dataframe). I suspect then that in saving such a frame pytables
is using a more complex referencing scheme (that isn't immediately obvious via h5py
).
Guess that's a long answer to just say I don't know.
Pandas own h5 load:
In [19]: pd.read_hdf('stack63452223.h5', 'table')
Out[19]:
structure
0 [11CH2]1NCCN2C[C@@H]3CCC[C@@H]3c4cccc1c24
1 [11CH2]1NCCN2[C@@H]3CCC[C@@H]3c4cccc1c24
2 [11CH3]c1ccc(cc1)c2cc(nn2c3ccc(cc3)S(=O)(=O)N)...
3 [11CH3]c1ccccc1O[C@H]([C@@H]2CNCCO2)c3ccccc3
4 [11CH3]c1ccccc1S[C@H]([C@@H]2CNCCO2)c3ccccc3
The h5 objects also have attrs
,
In [38]: f['table'].attrs.keys()
Out[38]: <KeysViewHDF5 ['CLASS', 'TITLE', 'VERSION', 'axis0_variety', 'axis1_variety', 'block0_items_variety', 'encoding', 'errors', 'nblocks', 'ndim', 'pandas_type', 'pandas_version']>
Fiddling around I found that:
In [66]: x=f['table']['block0_values'][0]
In [67]: b''.join(x.view('S1').tolist())
Out[67]: b'\x80\x04\x95y\x01\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x05K\x01\x86\x94h\x03\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94\x89\x88\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK?t\x94b\x89]\x94(\x8c)[11CH2]1NCCN2C[C@@H]3CCC[C@@H]3c4cccc1c24\x94\x8c([11CH2]1NCCN2[C@@H]3CCC[C@@H]3c4cccc1c24\x94\x8c6[11CH3]c1ccc(cc1)c2cc(nn2c3ccc(cc3)S(=O)(=O)N)C(F)(F)F\x94\x8c,[11CH3]c1ccccc1O[C@H]([C@@H]2CNCCO2)c3ccccc3\x94\x8c,[11CH3]c1ccccc1S[C@H]([C@@H]2CNCCO2)c3ccccc3\x94et\x94b.'
Looks like your strings are there. uint8
is a single byte dtype, which can be viewed as byte. Joining them I see your strings, concatenated in some fashion.
reformating:
Out[67]: b'\x80\x04\x95y\x01\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x05K\x01\x86\x94h\x03\x8c\x05dtype\x94\x93\x94\x8c\x02O8\x94\x89\x88\x87\x94R\x94(K\x03\x8c\x01|\x94NNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK?t\x94b\x89]\x94(\x8c)
[11CH2]1NCCN2C[C@@H]3CCC[C@@H]3c4cccc1c24\x94\x8c(
[11CH2]1NCCN2[C@@H]3CCC[C@@H]3c4cccc1c24\x94\x8c6
[11CH3]c1ccc(cc1)c2cc(nn2c3ccc(cc3)S(=O)(=O)N)C(F)(F)F\x94\x8c,
[11CH3]c1ccccc1O[C@H]([C@@H]2CNCCO2)c3ccccc3\x94\x8c,
[11CH3]c1ccccc1S[C@H]([C@@H]2CNCCO2)c3ccccc3\x94et\x94b.'
Upvotes: 1