Reputation: 1422
I'd like to manipulate a set of data in an hdf5
file and be able to decide, before closing the file, whether to discard every changes or not.
From the doc of File drivers:
HDF5 ships with a variety of different low-level drivers, which map the logical HDF5 address space to different storage mechanisms. You can specify which driver you want to use when the file is opened:
f = h5py.File('myfile.hdf5', driver=<driver name>, <driver_kwds>)
For example, the HDF5 “core” driver can be used to create a purely in-memory HDF5 file, optionally written out to disk when it is closed. Here’s a list of supported drivers and their options:
‘core’:
Store and manipulate the data in memory, and optionally write it back out when the file is closed. Using this with an existing file and a reading mode will read the entire file into memory. Keywords:
backing_store:
If True (default), save changes to the real file at the specified path on close() or flush(). If False, any changes are discarded when the file is closed.
Regardless whether I perform a call to flush()
or not, changes are always discarded (as expected). While, opening with default driver, changes are always persisted to the file on closure.
Based on what above, I've just created a very simple example:
from h5py import File
# Create a dummy file from scratch
f = File('test.h5', 'w')
f.create_dataset("test_dataset", data=[1, 2, 3])
f.close()
# Open and modify the data
f = File('test.h5', 'r+') # In this case changes are always persisted
# f = File('test.h5', 'r+', driver='core', backing_store=False) # In this case changes are always discarded
ds = f["test_dataset"]
ds[...] = [3, 4, 5]
# f.flush() # Useless in this case, obviously
f.close() # Here changes should be discarded
# Read now `test_dataset`
f = File('test.h5', 'r')
print(f['test_dataset'][...])
f.close()
Is there a way to decide just before closing the file whether to save changes or not?
undo
mechanism seems to work ONLY with newly created dataset, NOT with editing of pre-existing onesimport tables as t
import numpy as np
# Create the file
with t.open_file(r'test.h5', 'w') as fr:
fr.create_carray('/', 'TestArray', obj=np.array([1, 2, 3], dtype='uint8'))
with t.open_file('test.h5', 'r+') as fr:
# This will remove any previously created marks
if fr.is_undo_enabled():
fr.disable_undo()
fr.enable_undo() # Re-enable undo
fr.mark('MyMark')
# Create new array from scratch, and it will be discarded
new_arr = fr.create_carray('/', 'NewCreatedArray', obj=np.array([10, 11, 12]))
# Modify a pre-existing array! --> THIS WILL NOT BE DISCARDED
arr = fr.root.TestArray
arr[...] = np.array([3, 4, 5])
# Move back to when I opened the file
fr.undo('MyMark')
with t.open_file('test.h5', 'r+') as fr:
print(fr)
print('Test Array: ', fr.root.TestArray[:])
Result is:
test.h5 (File) ''
Last modif.: '2023-05-18T07:26:13+00:00'
Object Tree:
/ (RootGroup) ''
/TestArray (CArray(3,)) ''
Test Array: [3 4 5]
Upvotes: 1
Views: 187
Reputation: 8091
As noted in my comments, PyTables has the ability to mark, undo, and redo the database status. I created 2 examples to show different processes. Print statements are provided to show what happens in each example.
Note: enable_undo
is persistent in the file. So, you can reopen the file with mode='r+'
, enter h5f.undo("test2")
, and changes back to mark "test2"
will be undone.
Example 1
import tables as tb
# Create a dummy file from scratch
with tb.File('test_tb.h5', 'w') as h5f:
h5f.create_array("/","test0", obj=[1, 2, 3])
# Open, enable undo and modify the data
# Set mark, add data, then undo; changes are discarded
with tb.File('test_tb.h5', 'r+')as h5f:
h5f.enable_undo()
h5f.mark("test1") # mark name optional
h5f.create_array("/","test1", obj=[11, 12, 13])
print("\n*** Before Undo ***")
for node in h5f.iter_nodes('/'):
print(node._v_pathname)
h5f.undo("test1")
print("\n*** After Undo ***")
for node in h5f.iter_nodes('/'):
print(node._v_pathname)
# Open and modify the data
# Set mark, add data, do not undo; changes are saved
with tb.File('test_tb.h5', 'r+')as h5f:
h5f.mark("test2") # mark name optional
h5f.create_array("/","test2", obj=[21, 22, 23])
with tb.File('test_tb.h5', 'r')as h5f:
print("\n*** Final ***")
for node in h5f.iter_nodes('/'):
print(node._v_pathname)
print(node[:])
Example 2
import tables as t
import numpy as np
# Create the file
with t.open_file(r'test_tb2.h5', 'w') as fr:
fr.create_carray('/', 'TestArray', obj=np.array([1, 2, 3], dtype='uint8'))
print(fr)
print('Test Array: ', fr.root.TestArray[:],'\n')
with t.open_file('test_tb2.h5', 'r+') as fr:
# This will remove any previously created marks
if fr.is_undo_enabled():
fr.disable_undo()
fr.enable_undo() # Re-enable undo
fr.mark('MyMark')
# Rename pre-existing array! --> Undo works on this
fr.rename_node('/', 'TestArray_bck', name='TestArray')
# Create a new array dataset using the previous name
fr.create_carray('/', 'TestArray', obj=np.array([3, 4, 5], dtype='uint8'))
print(fr)
print('Test Array: ', fr.root.TestArray[:],'\n')
# Move back to when I opened the file
fr.undo('MyMark')
print(fr)
print('Test Array: ', fr.root.TestArray[:],'\n')
Upvotes: 2
Reputation: 13463
I'm not convinced that this is the best approach as it will drive up your memory requirements. Possible alternatives are
Anyway, one thing you can do is work on a BytesIO
buffer. This mimics the "core" driver but gives you full control about when to write the buffer back to file.
import io
import h5py
with open("foo.h5", "rb") as infile:
raw = infile.read()
buf = io.BytesIO(raw)
del raw
with h5py.File(buf, "a") as memfile:
memfile.create_dataset("bar", data=np.random.random((100, 100)))
with open("foo.h5", "wb") as outfile:
outfile.write(buf.getbuffer())
Upvotes: 0