171227register
171227register

Reputation: 21

How to read data in HDF5 format file partially when the data is too large to read fully

I am engaged in analysing HDF5 format data for scientific research purposes. I'm using Python's h5py library.

Now, the HDF file I want to read is so large. Its file size is about 20GB and the main part of its data is 400000*10000 float matrix. I tried to read the data once, but my development environment Spyder was terminated by compulsion because of the shortage of the memory. Then is there any method to read it partially and avoid this problem?

Upvotes: 2

Views: 2125

Answers (2)

Anonymous
Anonymous

Reputation: 11

Use pd.read_hdf with columns argument. See example below:

import numpy as np
import pandas as pd
from contexttimer import Timer


def create_sample_df():
    with Timer() as t:
        df = pd.DataFrame(np.random.rand(100000, 5000))
        df.to_hdf('file.h5', 'df', format='table')
    print('create_sample_df: %.2fs' % t.elapsed)


def read_full_df():
    """ data is too large to read fully """
    with Timer() as t:
        df = pd.read_hdf('file.h5')
    print('read_full_df: %.2fs' % t.elapsed)


def read_df_with_start_stop():
    """ to quick look all columns """
    with Timer() as t:
        df = pd.read_hdf('file.h5', start=0, stop=5)
    print('read_df_with_start_stop: %.2fs' % t.elapsed)


def read_df_with_columns():
    """ to read dataframe (hdf5) with necessary columns """
    with Timer() as t:
        df = pd.read_hdf('file.h5', columns=list(range(4)))
    print('read_df_with_columns: %.2fs' % t.elapsed)


if __name__ == '__main__':
    create_sample_df()
    read_full_df()
    read_df_with_start_stop()
    read_df_with_columns()

    # outputs:
    # create_sample_df: 51.25s
    # read_full_df: 5.21s
    # read_df_with_start_stop: 0.03s
    # read_df_with_columns: 4.44s

read_df_with_columns only reduces space cost, but does not necessarily improve speed performance. And this is under the assumption that the HDF5 was saved in table format (otherwise columns argument cannot be applied).

Upvotes: 1

James Tocknell
James Tocknell

Reputation: 543

You can slice h5py datasets like numpy arrays, so you could work on a number of subsets instead of the whole dataset (e.g. 4 100000*10000 subsets).

Upvotes: 0

Related Questions