Reputation: 71
I currently run a simulation several times and want to save the results of these simulations so that they can be used for visualizations.
The simulation is run 100 times and each of these simulations generates about 1 million data points (i.e. 1 million values for 1 million episodes), which I now want to store efficiently. The goal within each of these episodes would be to generate an average of each value across all 100 simulations.
My main
file looks like this:
# Defining the test simulation environment
def test_simulation:
environment = environment(
periods = 1000000
parameter_x = ...
parameter_y = ...
)
# Defining the simulation
environment.simulation()
# Save simulation data
hf = h5py.File('runs/simulation_runs.h5', 'a')
hf.create_dataset('data', data=environment.value_history, compression='gzip', chunks=True)
hf.close()
# Run the simulation 100 times
for i in range(100):
print(f'--- Iteration {i} ---')
test_simulation()
The value_history
is generated within game()
, i.e. the values are continuously appended to an empty list according to:
def simulation:
for episode in range(periods):
value = doSomething()
self.value_history.append(value)
Now I get the following error message when going to the next simulation:
ValueError: Unable to create dataset (name already exists)
I am aware that the current code keeps trying to create a new file and generates an error because it already exists. Now I am looking to reopen the file created in the first simulation, append the data from the next simulation and save it again.
Upvotes: 1
Views: 321
Reputation: 8016
The example below shows how to pull all these ideas together. It creates 2 files:
maxshape()
parameter on first loop, then use dataset.resize()
on subsequent loops -- output is
simulation_runs1.h5
simulation_runs2.h5
.I created a simple 100x100 NumPy array of randoms for the "simulation data", and ran the simulation 10 times. They are variables, so you can increase to larger values to determine which method is better (faster) for your data. You may also discover memory limitations saving 1M data points for 1M time periods.
Note 1: If you can't save all the data in system memory, you can incrementally save simulation results to the H5 file. It's just a little more complicated.
Note 2: I added a mode
variable to control whether a new file is created for the first simulation (i==0
) or the existing file is opened in append mode for subsequent simulations.
import h5py
import numpy as np
# Create some psuedo-test data
def test_simulation(i):
periods = 100
times = 100
# Define the simulation with some random data
val_hist = np.random.random(periods*times).reshape(periods,times)
a0, a1 = val_hist.shape[0], val_hist.shape[1]
if i == 0:
mode='w'
else:
mode='a'
# Save simulation data (resize dataset)
with h5py.File('runs/simulation_runs1.h5', mode) as hf:
if 'data' not in list(hf.keys()):
print('create new dataset')
hf.create_dataset('data', shape=(1,a0,a1), maxshape=(None,a0,a1), data=val_hist,
compression='gzip', chunks=True)
else:
print('resize existing dataset')
d0 = hf['data'].shape[0]
hf['data'].resize( (d0+1,a0,a1) )
hf['data'][d0:d0+1,:,:] = val_hist
# Save simulation data (unique datasets)
with h5py.File('runs/simulation_runs2.h5', mode) as hf:
hf.create_dataset(f'data_{i:03}', data=val_hist,
compression='gzip', chunks=True)
# Run the simulation 100 times
for i in range(10):
print(f'--- Iteration {i} ---')
test_simulation(i)
Upvotes: 1