Daniel Crane
Daniel Crane

Reputation: 257

h5py: Store list of list of strings

Given a list of list of strings, such as:

test_array = [ ['a1','a2'], ['b1'], ['c1','c2','c3','c4'] ]

I'd like to store it using h5py such that:

f['test_dataset'][0] = ['a1','a2']
f['test_dataset'][0][0] = 'a1'
etc.

Following the advice in the thread H5py store list of list of strings, I tried the following:

import h5py
test_array = [ ['a1','a2'], ['b1'], ['c1','c2','c3','c4'] ]
with h5py.File('test.h5','w') as f:
    string_dt = h5py.special_dtype(vlen=str)
    f.create_dataset('test_dataset',data=test_array,dtype=string_dt)

However this results in each of the nested lists being stored as strings, i.e.:

f['test_dataset'][0] = "['a1', 'a2']"
f['test_dataset'][0][0] = '['

If this isn't possible using h5py, or any other hdf5-based library, I'd be happy to hear any suggestions of other possible formats/libraries that I could use to store my data.

My data consists of multidimensional numpy integer arrays and nested lists of strings as in the example above, with around >100M rows and ~8 columns.

Thanks!

Upvotes: 3

Views: 3841

Answers (2)

周志华
周志华

Reputation: 1

ugly workaround

hf.create_dataset('test', data=repr(test_array))

Upvotes: 0

hpaulj
hpaulj

Reputation: 231385

In Saving with h5py arrays of different sizes

I suggest saving a list of variable length arrays as multiple datasets.

In [19]: f = h5py.File('test.h5','w')
In [20]: g = f.create_group('test_array')
In [21]: test_array = [ ['a1','a2'], ['b1'], ['c1','c2','c3','c4'] ]
In [22]: string_dt = h5py.special_dtype(vlen=str)
In [23]: for i,v in enumerate(test_array):
    ...:     g.create_dataset(str(i), data=np.array(v,'S4'), dtype=string_dt)
    ...:     
In [24]: for k in g.keys():
    ...:     print(k,g[k][:])
    ...:     
0 ['a1' 'a2']
1 ['b1']
2 ['c1' 'c2' 'c3' 'c4']

For many small sublists this could be messy, though I'm not sure it's in efficient.

'flattening' with a list join might work

In [27]: list1 =[', '.join(x) for x in test_array]
In [28]: list1
Out[28]: ['a1, a2', 'b1', 'c1, c2, c3, c4']
In [30]: '\n'.join(list1)
Out[30]: 'a1, a2\nb1\nc1, c2, c3, c4'

The nested list can be recreated with a few split.

Another thought - pickle to a string and save that.


From the h5py intro

An HDF5 file is a container for two kinds of objects: datasets, which
are array-like collections of data, and groups, which are folder-like
containers that hold datasets and other groups. The most fundamental
thing to remember when using h5py is:

Groups work like dictionaries, and datasets work like NumPy arrays

pickle doesn't work

In [32]: import pickle
In [33]: pickle.dumps(test_array)
Out[33]: b'\x80\x03]q\x00(]q\x01(X\x02\x00\x00\x00a1q\x02X\x02\x00\x00\x00a2q\x03e]q\x04X\x02\x00\x00\x00b1q\x05a]q\x06(X\x02\x00\x00\x00c1q\x07X\x02\x00\x00\x00c2q\x08X\x02\x00\x00\x00c3q\tX\x02\x00\x00\x00c4q\nee.'
In [34]: f.create_dataset('pickled', data=pickle.dumps(test_array), dtype=string
    ...: _dt)
....
ValueError: VLEN strings do not support embedded NULLs

json

In [35]: import json
In [36]: json.dumps(test_array)
Out[36]: '[["a1", "a2"], ["b1"], ["c1", "c2", "c3", "c4"]]'
In [37]: f.create_dataset('pickled', data=json.dumps(test_array), dtype=string_d
    ...: t)
Out[37]: <HDF5 dataset "pickled": shape (), type "|O">
In [43]: json.loads(f['pickled'].value)
Out[43]: [['a1', 'a2'], ['b1'], ['c1', 'c2', 'c3', 'c4']]

Upvotes: 1

Related Questions