Reputation: 1422

How to choose whether to persist or discard changes in HDF5 file before closing

I'd like to manipulate a set of data in an hdf5 file and be able to decide, before closing the file, whether to discard every changes or not. From the doc of File drivers:

HDF5 ships with a variety of different low-level drivers, which map the logical HDF5 address space to different storage mechanisms. You can specify which driver you want to use when the file is opened:
f = h5py.File('myfile.hdf5', driver=<driver name>, <driver_kwds>)
For example, the HDF5 “core” driver can be used to create a purely in-memory HDF5 file, optionally written out to disk when it is closed. Here’s a list of supported drivers and their options:

‘core’:

Store and manipulate the data in memory, and optionally write it back out when the file is closed. Using this with an existing file and a reading mode will read the entire file into memory. Keywords:

backing_store:

If True (default), save changes to the real file at the specified path on close() or flush(). If False, any changes are discarded when the file is closed.

Regardless whether I perform a call to flush() or not, changes are always discarded (as expected). While, opening with default driver, changes are always persisted to the file on closure.

Based on what above, I've just created a very simple example:

from h5py import File



# Create a dummy file from scratch
f = File('test.h5', 'w')
f.create_dataset("test_dataset", data=[1, 2, 3])
f.close()



# Open and modify the data
f = File('test.h5', 'r+')  # In this case changes are always persisted
# f = File('test.h5', 'r+', driver='core', backing_store=False)  # In this case changes are always discarded
ds = f["test_dataset"]
ds[...] = [3, 4, 5]
# f.flush()  # Useless in this case, obviously
f.close()  # Here changes should be discarded



# Read now `test_dataset`
f = File('test.h5', 'r')
print(f['test_dataset'][...])
f.close()

Is there a way to decide just before closing the file whether to save changes or not?

EDIT 1: PyTables `undo` mechanism seems to work ONLY with newly created dataset, NOT with editing of pre-existing ones

import tables as t
import numpy as np

# Create the file
with t.open_file(r'test.h5', 'w') as fr:
    fr.create_carray('/', 'TestArray', obj=np.array([1, 2, 3], dtype='uint8'))

with t.open_file('test.h5', 'r+') as fr:
    # This will remove any previously created marks
    if fr.is_undo_enabled():
        fr.disable_undo()
    fr.enable_undo()  # Re-enable undo
    fr.mark('MyMark')

    # Create new array from scratch, and it will be discarded
    new_arr = fr.create_carray('/', 'NewCreatedArray', obj=np.array([10, 11, 12]))

    # Modify a pre-existing array! --> THIS WILL NOT BE DISCARDED
    arr = fr.root.TestArray
    arr[...] = np.array([3, 4, 5])

    # Move back to when I opened the file
    fr.undo('MyMark')

with t.open_file('test.h5', 'r+') as fr:
    print(fr)
    print('Test Array: ', fr.root.TestArray[:])

Result is:

test.h5 (File) ''
Last modif.: '2023-05-18T07:26:13+00:00'
Object Tree: 
/ (RootGroup) ''
/TestArray (CArray(3,)) ''

Test Array:  [3 4 5]

Upvotes: 1

Answers (2)

kcw78

Reputation: 8091

As noted in my comments, PyTables has the ability to mark, undo, and redo the database status. I created 2 examples to show different processes. Print statements are provided to show what happens in each example.

The 1st example shows how to undo a newly created data set. It creates a file with an array dataset ('test0'). Next a mark is set, and a 2nd array dataset ('test1') is added, followed by undo. Then a 2nd mark is set and a 3rd dataset('test2') is created (but undo not called).
The 2nd example is a followup to @Buzz's comments on May 18, 2023. It shows the process to modify the values of an existing dataset (as an extension of the code added to the question). It creates a file with an array dataset ('TestArray'). Next a mark is set, and the 1st array dataset is renamed ('TestArray_bck'). Then a 2nd dataset is created using the same name as the 1st, but with new data values. After undo is called, the file reverts to the status before the mark.

Note: enable_undo is persistent in the file. So, you can reopen the file with mode='r+', enter h5f.undo("test2"), and changes back to mark "test2" will be undone.

Example 1

import tables as tb

# Create a dummy file from scratch
with tb.File('test_tb.h5', 'w') as h5f:
    h5f.create_array("/","test0", obj=[1, 2, 3])

# Open, enable undo and modify the data 
# Set mark, add data, then undo; changes are discarded
with tb.File('test_tb.h5', 'r+')as h5f:
    h5f.enable_undo()
    h5f.mark("test1") # mark name optional
    h5f.create_array("/","test1", obj=[11, 12, 13])
    print("\n*** Before Undo ***")
    for node in h5f.iter_nodes('/'):
        print(node._v_pathname)
    h5f.undo("test1")
    print("\n*** After Undo ***")
    for node in h5f.iter_nodes('/'):
        print(node._v_pathname)
    
# Open and modify the data
# Set mark, add data, do not undo; changes are saved
with tb.File('test_tb.h5', 'r+')as h5f:
    h5f.mark("test2") # mark name optional
    h5f.create_array("/","test2", obj=[21, 22, 23])

with tb.File('test_tb.h5', 'r')as h5f:
    print("\n*** Final ***")
    for node in h5f.iter_nodes('/'):
        print(node._v_pathname)
        print(node[:])

Example 2

import tables as t
import numpy as np

# Create the file
with t.open_file(r'test_tb2.h5', 'w') as fr:
    fr.create_carray('/', 'TestArray', obj=np.array([1, 2, 3], dtype='uint8'))
    print(fr)
    print('Test Array: ', fr.root.TestArray[:],'\n')
    
with t.open_file('test_tb2.h5', 'r+') as fr:
    # This will remove any previously created marks
    if fr.is_undo_enabled():
        fr.disable_undo()
    fr.enable_undo()  # Re-enable undo
    fr.mark('MyMark')

    # Rename pre-existing array! --> Undo works on this
    fr.rename_node('/', 'TestArray_bck', name='TestArray')
    # Create a new array dataset using the previous name
    fr.create_carray('/', 'TestArray', obj=np.array([3, 4, 5], dtype='uint8'))

    print(fr)
    print('Test Array: ', fr.root.TestArray[:],'\n')
    
    # Move back to when I opened the file
    fr.undo('MyMark')
    print(fr)
    print('Test Array: ', fr.root.TestArray[:],'\n')

Upvotes: 2

Homer512

Reputation: 13463

I'm not convinced that this is the best approach as it will drive up your memory requirements. Possible alternatives are

working on a copy of the file, preferably on a file system with copy-on-write support
recording changes separately (possibly with a temporary file as buffer), then applying them to the file at the end
switching to a different storage that supports transactions

Anyway, one thing you can do is work on a BytesIO buffer. This mimics the "core" driver but gives you full control about when to write the buffer back to file.

import io
import h5py

with open("foo.h5", "rb") as infile:
    raw = infile.read()
buf = io.BytesIO(raw)
del raw
with h5py.File(buf, "a") as memfile:
    memfile.create_dataset("bar", data=np.random.random((100, 100)))
with open("foo.h5", "wb") as outfile:
    outfile.write(buf.getbuffer())

Upvotes: 0

How to choose whether to persist or discard changes in HDF5 file before closing

EDIT 1: PyTables undo mechanism seems to work ONLY with newly created dataset, NOT with editing of pre-existing ones

Answers (2)

Related Questions

EDIT 1: PyTables `undo` mechanism seems to work ONLY with newly created dataset, NOT with editing of pre-existing ones