Reputation: 57
When I create a pandas DataFrame with a column that contains a string and save it as HDF5 the file size seems extremely large. The following code processes a file with a size of 1’063’224 Bytes.
from pathlib import Path
import pandas as pd
data_frame = pd.DataFrame({'foo': ['bar']})
data_frame.to_hdf(Path.home() / 'Desktop' / 'file.hdf5', key='baz', mode='w')
When I replace the 'bar'
with a 1
the file size shrinks down to 7’032 Bytes, which seems (more) reasonable.
Does anyone know where that megabyte of data comes from?
Upvotes: 0
Views: 655
Reputation: 143
The problem is that the dataframe is of dtype object
, since strings have variable length that is not permitted in HDF5.
The column can be converted to string with a length parameter:
data_frame['foo'] = data_frame['foo'].astype('|S80') #where the max length is set at 80 bytes
Using this conversion, the file is even smaller than the integer example with 7024 Bytes ?!
Upvotes: 2