Append simulation data using HDF5

Question

I currently run a simulation several times and want to save the results of these simulations so that they can be used for visualizations.

The simulation is run 100 times and each of these simulations generates about 1 million data points (i.e. 1 million values for 1 million episodes), which I now want to store efficiently. The goal within each of these episodes would be to generate an average of each value across all 100 simulations.

My main file looks like this:

# Defining the test simulation environment
def test_simulation:
    environment = environment(
            periods = 1000000
            parameter_x = ...
            parameter_y = ...
      )

    # Defining the simulation
    environment.simulation()

    # Save simulation data
    hf = h5py.File('runs/simulation_runs.h5', 'a')
    hf.create_dataset('data', data=environment.value_history, compression='gzip', chunks=True)
    hf.close()

# Run the simulation 100 times
for i in range(100):
    print(f'--- Iteration {i} ---')
    test_simulation()

The value_history is generated within game(), i.e. the values are continuously appended to an empty list according to:

def simulation:
    for episode in range(periods):
        value = doSomething()
        self.value_history.append(value)

Now I get the following error message when going to the next simulation:

ValueError: Unable to create dataset (name already exists)

I am aware that the current code keeps trying to create a new file and generates an error because it already exists. Now I am looking to reopen the file created in the first simulation, append the data from the next simulation and save it again.

kcw78 · Accepted Answer

The example below shows how to pull all these ideas together. It creates 2 files:

Create 1 resizable dataset with maxshape() parameter on first loop, then use dataset.resize() on subsequent loops -- output is simulation_runs1.h5
Create a unique dataset for each simulation -- output is simulation_runs2.h5.

I created a simple 100x100 NumPy array of randoms for the "simulation data", and ran the simulation 10 times. They are variables, so you can increase to larger values to determine which method is better (faster) for your data. You may also discover memory limitations saving 1M data points for 1M time periods.
Note 1: If you can't save all the data in system memory, you can incrementally save simulation results to the H5 file. It's just a little more complicated.
Note 2: I added a mode variable to control whether a new file is created for the first simulation (i==0) or the existing file is opened in append mode for subsequent simulations.

import h5py
import numpy as np

# Create some psuedo-test data
def test_simulation(i):
    periods = 100
    times = 100

    # Define the simulation with some random data
    val_hist = np.random.random(periods*times).reshape(periods,times)    
    a0, a1 = val_hist.shape[0], val_hist.shape[1]
    
    if i == 0:
        mode='w'
    else:
        mode='a'
        
    # Save simulation data (resize dataset)
    with h5py.File('runs/simulation_runs1.h5', mode) as hf:
        if 'data' not in list(hf.keys()):
            print('create new dataset')
            hf.create_dataset('data', shape=(1,a0,a1), maxshape=(None,a0,a1), data=val_hist, 
                              compression='gzip', chunks=True)
        else:
            print('resize existing dataset')
            d0 = hf['data'].shape[0]
            hf['data'].resize( (d0+1,a0,a1) )
            hf['data'][d0:d0+1,:,:] = val_hist
 
    # Save simulation data (unique datasets)
    with h5py.File('runs/simulation_runs2.h5', mode) as hf:
        hf.create_dataset(f'data_{i:03}', data=val_hist, 
                          compression='gzip', chunks=True)

# Run the simulation 100 times
for i in range(10):
    print(f'--- Iteration {i} ---')
    test_simulation(i)

Append simulation data using HDF5

Answers (1)

Related Questions