Reputation: 257
Given a list of list of strings, such as:
test_array = [ ['a1','a2'], ['b1'], ['c1','c2','c3','c4'] ]
I'd like to store it using h5py such that:
f['test_dataset'][0] = ['a1','a2']
f['test_dataset'][0][0] = 'a1'
etc.
Following the advice in the thread H5py store list of list of strings, I tried the following:
import h5py
test_array = [ ['a1','a2'], ['b1'], ['c1','c2','c3','c4'] ]
with h5py.File('test.h5','w') as f:
string_dt = h5py.special_dtype(vlen=str)
f.create_dataset('test_dataset',data=test_array,dtype=string_dt)
However this results in each of the nested lists being stored as strings, i.e.:
f['test_dataset'][0] = "['a1', 'a2']"
f['test_dataset'][0][0] = '['
If this isn't possible using h5py, or any other hdf5-based library, I'd be happy to hear any suggestions of other possible formats/libraries that I could use to store my data.
My data consists of multidimensional numpy integer arrays and nested lists of strings as in the example above, with around >100M rows and ~8 columns.
Thanks!
Upvotes: 3
Views: 3841
Reputation: 231385
In Saving with h5py arrays of different sizes
I suggest saving a list of variable length arrays as multiple datasets.
In [19]: f = h5py.File('test.h5','w')
In [20]: g = f.create_group('test_array')
In [21]: test_array = [ ['a1','a2'], ['b1'], ['c1','c2','c3','c4'] ]
In [22]: string_dt = h5py.special_dtype(vlen=str)
In [23]: for i,v in enumerate(test_array):
...: g.create_dataset(str(i), data=np.array(v,'S4'), dtype=string_dt)
...:
In [24]: for k in g.keys():
...: print(k,g[k][:])
...:
0 ['a1' 'a2']
1 ['b1']
2 ['c1' 'c2' 'c3' 'c4']
For many small sublists this could be messy, though I'm not sure it's in efficient.
'flattening' with a list join might work
In [27]: list1 =[', '.join(x) for x in test_array]
In [28]: list1
Out[28]: ['a1, a2', 'b1', 'c1, c2, c3, c4']
In [30]: '\n'.join(list1)
Out[30]: 'a1, a2\nb1\nc1, c2, c3, c4'
The nested list can be recreated with a few split
.
Another thought - pickle to a string and save that.
From the h5py
intro
An HDF5 file is a container for two kinds of objects: datasets, which
are array-like collections of data, and groups, which are folder-like
containers that hold datasets and other groups. The most fundamental
thing to remember when using h5py is:
Groups work like dictionaries, and datasets work like NumPy arrays
pickle
doesn't work
In [32]: import pickle
In [33]: pickle.dumps(test_array)
Out[33]: b'\x80\x03]q\x00(]q\x01(X\x02\x00\x00\x00a1q\x02X\x02\x00\x00\x00a2q\x03e]q\x04X\x02\x00\x00\x00b1q\x05a]q\x06(X\x02\x00\x00\x00c1q\x07X\x02\x00\x00\x00c2q\x08X\x02\x00\x00\x00c3q\tX\x02\x00\x00\x00c4q\nee.'
In [34]: f.create_dataset('pickled', data=pickle.dumps(test_array), dtype=string
...: _dt)
....
ValueError: VLEN strings do not support embedded NULLs
In [35]: import json
In [36]: json.dumps(test_array)
Out[36]: '[["a1", "a2"], ["b1"], ["c1", "c2", "c3", "c4"]]'
In [37]: f.create_dataset('pickled', data=json.dumps(test_array), dtype=string_d
...: t)
Out[37]: <HDF5 dataset "pickled": shape (), type "|O">
In [43]: json.loads(f['pickled'].value)
Out[43]: [['a1', 'a2'], ['b1'], ['c1', 'c2', 'c3', 'c4']]
Upvotes: 1