Reputation: 179
I have an HDF5
with about 13,000 rows × 5 columns, these rows were appended over time to the same file with DF.to_hdf(Filename, 'df', append=True, format='table')
and here's the size:
-rw-r--r-- 1 omnom omnom 807M Mar 10 15:55 Final_all_result.h5
Recently I received a ValueError
because the data I was trying to append to one of the columns is longer than the declared column size (2000, with min_itemsize
).
So I loaded all rows to memory and dump these into a new HDF
file at one go with:
DF.to_hdf(newFilename, \
'df', \
mode='a', \
data_columns=['Code', 'ID', 'Category', 'Title', 'Content'], \
format='table', \
min_itemsize={'index': 24, \
'Code': 8, \
'ID': 32, \
'Category': 24, \
'Title': 192, \
'Content':5000 \
} \
)
I was really surprised that the new file size is about 1/10 of the original file:
-rw-r--r-- 1 omnom omnom 70M Mar 10 16:01 Final_all_result_5000.h5
I double checked the number of rows in both files, they're equal.
Do I append new rows the wrong way that causes the file size to multiple with every append operation? Googled and searched here but don't think this was discussed before, or maybe I searched with the wrong keywords.
Any help is appreciated.
UPDATE:
I tried adding min_itemsize
for all data columns in the append line per suggestion in this thread: pandas pytables append: performance and increase in file size:
DF.to_hdf(h5AbsPath, \
'df', \
mode='a', \
data_columns=['Code', 'ID', 'Category', 'Title', 'Content'], \
min_itemsize={'index': 24, \
'Code': 8, \
'ID': 32, \
'Category': 24, \
'Title': 192, \
'Content':5000 \
}, \
append=True \
)
but still it doesn't reduce file size.
Thanks for suggestions to add compression, both the appended and newly dumped files are not compressed per requirement.
Upvotes: 1
Views: 1712
Reputation: 21542
I used to save .h5 files from pandas DataFrame. Try adding complib='blosc'
and complevel=9
to the to_hdf()
function. This should reduce the file size.
Upvotes: 1