Peter Willemsen
Peter Willemsen

Reputation: 379

Writing large amounts of numbers to a HDF5 file in Python

I currently have a data-set with a million rows and each around 10000 columns (variable length).

Now I want to write this data to a HDF5 file so I can use it later on. I got this to work, but it's incredibly slow. Even a 1000 values take up to a few minutes just to get stored in the HDF5 file.

I've been looking everywhere, including SO and the H5Py docs, but I really can't find anything that describes my use-case, yet I know it can be done.

Below I have made a demo-source code describing what I'm doing right now:

import h5py
import numpy as np

# I am using just random values here
# I know I can use h5py broadcasts and I have seen it being used before.
# But the issue I have is that I need to save around a million rows with each 10000 values
# so I can't keep the entire array in memory.
random_ints = np.random.random(size = (5000,10000))

# See http://stackoverflow.com/a/36902906/3991199 for "libver='latest'"
with h5py.File('my.data.hdf5', "w", libver='latest') as f:
    X = f.create_dataset("X", (5000,10000))
    for i1 in range(0, 5000):
        for i2 in range(0, 10000):
            X[i1,i2] = random_ints[i1,i2]

        if i1 != 0 and i1 % 1000 == 0:
            print "Done %d values..." % i1

This data comes from a database, it's not a pre-generated np array, as being seen in the source code.

If you run this code you can see it takes a long time before it prints out "Done 1000 values".

I'm on a laptop with 8GB ram, Ubuntu 16.04 LTS, and Intel Core M (which performs similar to Core i5) and SSD, that must be enough to perform a bit faster than this.

I've read about broadcasting here: http://docs.h5py.org/en/latest/high/dataset.html

When I use it like this:

for i1 in range(0, 5000):
        X[i1,:] = random_ints[i1]

It already goes a magnitude faster (done is a few secs). But I don't know how to get that to work with a variable-length dataset (the columns are variable-length). It would be nice to get a bit of insights in how this should be done, as I think I'm not having a good idea of the concept of HDF5 right now :) Thanks a lot!

Upvotes: 1

Views: 2154

Answers (1)

hpaulj
hpaulj

Reputation: 231385

Following http://docs.h5py.org/en/latest/special.html

and using an open h5 file f, I tried:

dt = h5py.special_dtype(vlen=np.dtype('int32'))
vset=f.create_dataset('vset', (100,), dtype=dt)

Setting the elements one by one:

vset[0]=np.random.randint(0,100,1000)    # set just one element
for i in range(100):    # set all arrays of varying length
    vset[i]=np.random.randint(0,100,i)
vset[:]      # view the dataset

Or making an object array:

D=np.empty((100,),dtype=object)
for i in range(100):   # setting that in same way
    D[i]=np.random.randint(0,100,i)

vset[:]=D    # write it to the file

vset[:]=D[::-1]   # or write it in reverse order

A portion of the last write:

In [587]: vset[-10:]
Out[587]: 
array([array([52, 52, 46, 80,  5, 89,  6, 63, 21]),
       array([38, 95, 51, 35, 66, 44, 29, 26]),
       array([51, 96,  3, 64, 55, 31, 18]),
       array([85, 96, 30, 82, 33, 45]), array([28, 37, 61, 57, 88]),
       array([76, 65,  5, 29]), array([78, 29, 72]), array([77, 32]),
       array([5]), array([], dtype=int32)], dtype=object)

I can view portions of an element with:

In [593]: vset[3][:10]
Out[593]: array([86, 26,  2, 79, 90, 67, 66,  5, 63, 68])

but I can't treat it as a 2d array: vset[3,:10]. It's an array of arrays.

Upvotes: 1

Related Questions