Hoeze
Hoeze

Reputation: 716

HDF5 for data exchange between python and R

I'm working on a project which needs to save and load multiple:

Now I'd like to store all my data in a single file (or transparent data storage), but I'm not sure how to store tables correctly.
How should I save the table's axis labels, in a way keeping the data programming language - independent?

Upvotes: 4

Views: 971

Answers (1)

jpp
jpp

Reputation: 164623

Your question is broad, but I will try and dispel some myths to get your started. I only have experience with Python, so my examples will only relate to using HDF5 with Python.

Pandas or PyTables can access HDF5 files, but they do not allow to store plain NumPy arrays I think.

You are correct in that PyTables doesn't let you save a plain NumPy array without any additional overhead. But you don't need to use PyTables. h5py offers a NumPy-like interface to storing and accessing arrays in / from HDF5 files.

Store a NumPy array

import h5py, numpy as np

arr = np.random.randint(0, 10, (1000, 1000))

f = h5py.File('file.h5', 'w', libver='latest')  # use 'latest' for performance

dset = f.create_dataset('array', shape=(1000, 1000), data=arr, chunks=(100, 100)
                        compression='gzip', compression_opts=9)

There are compression and chunking options which you can explore further to optimise read/write performance and compression ratios, according to your requirements. Note, however, that gzip is one of the few compression filters which ship with all HDF5 installations.

Store axis labels as attributes

Attributes are similar to datasets and allow you to store a wide range of data, including scalars or arrays.

dset.attrs['Description'] = 'Some text snippet'
dset.attrs['X-Labels'] = np.arange(1000)
dset.attrs['Y-Labels'] = np.arange(1000)

Internally, data is not stored as NumPy arrays, but in data-type sensitive contiguous memory blocks according to the HDF5 specification. As such, you will be able to read these files from any HDF5 API.

It is worth noting that there are specific requirements to ensure strings are transportable, see Strings in HDF5 from the h5py docs for more details.

Upvotes: 3

Related Questions