Om Nom
Om Nom

Reputation: 179

appending rows with pandas' to_hdf multiples H5 file size?

I have an HDF5 with about 13,000 rows × 5 columns, these rows were appended over time to the same file with DF.to_hdf(Filename, 'df', append=True, format='table') and here's the size:

-rw-r--r--  1 omnom  omnom   807M Mar 10 15:55 Final_all_result.h5

Recently I received a ValueError because the data I was trying to append to one of the columns is longer than the declared column size (2000, with min_itemsize).

So I loaded all rows to memory and dump these into a new HDF file at one go with:

DF.to_hdf(newFilename, \
                'df', \
                mode='a', \
                data_columns=['Code', 'ID', 'Category', 'Title', 'Content'], \
                format='table', \
                min_itemsize={'index': 24, \
                                'Code': 8, \
                                'ID': 32, \
                                'Category': 24, \
                                'Title': 192, \
                                'Content':5000 \
                                } \
                )

I was really surprised that the new file size is about 1/10 of the original file:

-rw-r--r--  1 omnom  omnom    70M Mar 10 16:01 Final_all_result_5000.h5

I double checked the number of rows in both files, they're equal.

Do I append new rows the wrong way that causes the file size to multiple with every append operation? Googled and searched here but don't think this was discussed before, or maybe I searched with the wrong keywords.

Any help is appreciated.

UPDATE: I tried adding min_itemsize for all data columns in the append line per suggestion in this thread: pandas pytables append: performance and increase in file size:

DF.to_hdf(h5AbsPath, \
                'df', \
                mode='a', \
                data_columns=['Code', 'ID', 'Category', 'Title', 'Content'], \
                min_itemsize={'index': 24, \
                                'Code': 8, \
                                'ID': 32, \
                                'Category': 24, \
                                'Title': 192, \
                                'Content':5000 \
                                }, \
                 append=True \
                 )

but still it doesn't reduce file size.

Thanks for suggestions to add compression, both the appended and newly dumped files are not compressed per requirement.

Upvotes: 1

Views: 1712

Answers (1)

Fabio Lamanna
Fabio Lamanna

Reputation: 21542

I used to save .h5 files from pandas DataFrame. Try adding complib='blosc' and complevel=9 to the to_hdf() function. This should reduce the file size.

Upvotes: 1

Related Questions