Tom de Geus
Tom de Geus

Reputation: 5975

Do I need to manually close a HDF5-file?

Do I understand correctly that HDF5-files should be manually closed like this:

import h5py

file = h5py.File('test.h5', 'r')

...

file.close()

From the documentation: "HDF5 files work generally like standard Python file objects. They support standard modes like r/w/a, and should be closed when they are no longer in use.".

But I wonder: will the garbage collection evoke file.close() when the script terminates or when file is overwritten?

Upvotes: 2

Views: 3543

Answers (1)

Tim Child
Tim Child

Reputation: 468

This was answered in the comments a long time ago by @kcw78, but I thought I might as well write it up as a quick answer for anyone else reaching this.

As @kcw78 says, you should explicitly close files when you are done with them by calling file.close(). From previous experience, I can tell you that h5py files are usually closed properly anyway when the script terminates, but occasionally the files would be corrupt (although I'm not sure if that ever happens when in 'r' mode only). Better not to leave it to chance!

As @kcw78 also suggests, using a context manager is a good way to go if you want to be safe. In either case, you need to be careful to actually extract the data you want before letting the file close.

e.g.

import h5py

with h5py.File('test.h5', 'w') as f:
    f['data'] = [1,2,3]

# Letting the file close and reopening in read only mode for example purposes

with h5py.File('test.h5', 'r') as f:
    dataset = f.get('data')  # get the h5py.Dataset
    data = dataset[:]  # Copy the array into memory 
    print(dataset.shape, data.shape)  # appear to behave the same
    print(dataset[0], data[0])  # appear to behave the same

print(data[0], data.shape)  # Works same as above
print(dataset[0], dataset.shape)  # Raises ValueError: Not a dataset

dataset[0] raises an error here because dataset was an instance of h5py.Dataset which was associated with f and was closed at the same time f was closed. Whereas data is just a numpy array containing only the data part of the dataset (i.e. no additional attributes).

Upvotes: 2

Related Questions