Dataframe size increases after saving to .h5 the first time

Question

A pandas dataframe file size increases significantly after saving it as .h5 the first time. If I save loaded dataframe, the file size doesn't increase again. It makes me suspect that some kind of meta-data is being saved the first time. What is it the reason for this crease?

Is there an easy way to avoid it?

I can compress the file but I am making comparisons without compression. Would the problem scale differently with compression?

Example code below. The file size increases from 15.3 MB to 22.9 MB

import numpy as np
import pandas as pd
x = np.random.normal (0,1, 1000000)
y = x*2
dataset = pd.DataFrame({'Column1': x, 'Column2': y})
print (dataset.info(memory_usage='deep'))
dataset.to_hdf('data.h5', key='df', mode='w')
dataset2 = pd.read_hdf("data.h5")
print (dataset2.info(memory_usage='deep'))
dataset2.to_hdf('data2.h5', key='df', mode='w')
dataset3 = pd.read_hdf("data2.h5")
print (dataset3.info(memory_usage='deep'))

Output:


RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1    1000000 non-null float64
Column2    1000000 non-null float64
dtypes: float64(2)
memory usage: 15.3 MB
None

Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1    1000000 non-null float64
Column2    1000000 non-null float64
dtypes: float64(2)
memory usage: 22.9 MB
None

Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1    1000000 non-null float64
Column2    1000000 non-null float64
dtypes: float64(2)
memory usage: 22.9 MB
None

It is happening because RangeIndex is converted to Int64Index on save. Is there a way to optimise this ? Looks like there's no way to drop the index:

https://github.com/pandas-dev/pandas/issues/8319

Dataframe size increases after saving to .h5 the first time

Answers (1)

Related Questions