reyymo
reyymo

Reputation: 57

Pandas DataFrame saved as HDF5 Files are extremely large when containing a column with a string values

When I create a pandas DataFrame with a column that contains a string and save it as HDF5 the file size seems extremely large. The following code processes a file with a size of 1’063’224 Bytes.

from pathlib import Path
import pandas as pd

data_frame = pd.DataFrame({'foo': ['bar']})
data_frame.to_hdf(Path.home() / 'Desktop' / 'file.hdf5', key='baz', mode='w')

When I replace the 'bar' with a 1 the file size shrinks down to 7’032 Bytes, which seems (more) reasonable.

Does anyone know where that megabyte of data comes from?

Upvotes: 0

Views: 655

Answers (1)

Radarrudi
Radarrudi

Reputation: 143

The problem is that the dataframe is of dtype object, since strings have variable length that is not permitted in HDF5.

The column can be converted to string with a length parameter:

data_frame['foo'] = data_frame['foo'].astype('|S80') #where the max length is set at 80 bytes

Using this conversion, the file is even smaller than the integer example with 7024 Bytes ?!

Upvotes: 2

Related Questions