user7867665
user7867665

Reputation: 882

Dataframe size increases after saving to .h5 the first time

A pandas dataframe file size increases significantly after saving it as .h5 the first time. If I save loaded dataframe, the file size doesn't increase again. It makes me suspect that some kind of meta-data is being saved the first time. What is it the reason for this crease?

Is there an easy way to avoid it?

I can compress the file but I am making comparisons without compression. Would the problem scale differently with compression?

Example code below. The file size increases from 15.3 MB to 22.9 MB

import numpy as np
import pandas as pd
x = np.random.normal (0,1, 1000000)
y = x*2
dataset = pd.DataFrame({'Column1': x, 'Column2': y})
print (dataset.info(memory_usage='deep'))
dataset.to_hdf('data.h5', key='df', mode='w')
dataset2 = pd.read_hdf("data.h5")
print (dataset2.info(memory_usage='deep'))
dataset2.to_hdf('data2.h5', key='df', mode='w')
dataset3 = pd.read_hdf("data2.h5")
print (dataset3.info(memory_usage='deep'))

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1    1000000 non-null float64
Column2    1000000 non-null float64
dtypes: float64(2)
memory usage: 15.3 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1    1000000 non-null float64
Column2    1000000 non-null float64
dtypes: float64(2)
memory usage: 22.9 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1    1000000 non-null float64
Column2    1000000 non-null float64
dtypes: float64(2)
memory usage: 22.9 MB
None

It is happening because RangeIndex is converted to Int64Index on save. Is there a way to optimise this ? Looks like there's no way to drop the index:

https://github.com/pandas-dev/pandas/issues/8319

Upvotes: 0

Views: 112

Answers (1)

user7867665
user7867665

Reputation: 882

The best solution I found till now is to save as pickle:

dataset.to_pickle("datapkl.pkl")

less convenient option is to convert to numpy and save with h5py, but then loading and converting back to pandas takes a lot of time

a = dataset.to_numpy()
h5f = h5py.File('datah5.h5', 'w')
h5f.create_dataset('dataset_1', data=a)

Upvotes: 1

Related Questions