Ason
Ason

Reputation: 509

Deleting information from an HDF5 file

I realize that a SO user has formerly asked this question but it was asked in 2009 and I was hoping that more knowledge of HDF5 was available or newer versions had fixed this particular issue. To restate the question here concerning my own problem;

I have a gigantic file of nodes and elements from a large geometry and have already retrieved all the useful information I need from it. Therefore, in Python, I am trying to keep the original file, but delete the information I do not need and fill in more information for other sources. For example, I have a dataset of nodes that I don't need. However, I need to keep the neighboring dataset and include information about their indices from an outside file. Is there any way to delete these specific datasets?

Or is the old idea of having "placekeepers" in the HDF5 file still holding true, such that no one knows how/bothers with removing info? I'm not too worried about the empty space, as long as it is faster to simply remove and add on information then to create an entirely new file.

Note: I'm using H5py's 'r+' to read and write.

Upvotes: 17

Views: 9923

Answers (3)

Shubhjot
Shubhjot

Reputation: 93

In HDF5 1.10 and above, there is a mechanism of file space management. It can be implemented by specifying fcpl(File Creation Property List) in H5F.create.

One important change that you would notice is that file after your first import would be a little bigger(in Kb) in the first import. But after that, your file size would eventually be smaller (after the reclaim process).

You can monitor the free space in your HDF5 files by using h5stat tool

h5stat -S filename

Upvotes: 0

Dana Robinson
Dana Robinson

Reputation: 4364

If you know that a particular dataset will be removed at the end of an analysis process, why keep it in the master file at all? I would store the temporary data in a separate HDF5 file which could be discarded after the analysis was complete. If it's important to link the temporary dataset inside the master file, just create an external link between the master and the temp using H5Lcreate_external(). External links consume a trivial amount of space.

Upvotes: 1

Ümit
Ümit

Reputation: 17489

Removing entire nodes (groups or datasets) from a hdf5 file should be no problem.
However if you want to reclaim the space you have to run the h5repack tool.

From the hdf5 docs:

5.5.2. Deleting a Dataset from a File and Reclaiming Space

HDF5 does not at this time provide an easy mechanism to remove a dataset from a file or to reclaim the storage space occupied by a deleted object.

Removing a dataset and reclaiming the space it used can be done with the H5Ldelete function and the h5repack utility program. With the H5Ldelete function, links to a dataset can be removed from the file structure. After all the links have been removed, the dataset becomes inaccessible to any application and is effectively removed from the file. The way to recover the space occupied by an unlinked dataset is to write all of the objects of the file into a new file. Any unlinked object is inaccessible to the application and will not be included in the new file. Writing objects to a new file can be done with a custom program or with the h5repack utility program.

Alternatively you can also have a look into PyTables`s ptrepack tool. PyTables should be able to read h5py hdf5 files and the ptrepack tool is similar to the h5repack.

If you want to remove records from a datasets, then you probably have to retrieve the records you want to keep and create a new dataset and remove the old one.
PyTables supports removing rows, however it's not recommended.

Upvotes: 15

Related Questions