Reputation: 882
A pandas dataframe file size increases significantly after saving it as .h5
the first time. If I save loaded dataframe, the file size doesn't increase again. It makes me suspect that some kind of meta-data is being saved the first time. What is it the reason for this crease?
Is there an easy way to avoid it?
I can compress the file but I am making comparisons without compression. Would the problem scale differently with compression?
Example code below. The file size increases from 15.3 MB
to 22.9 MB
import numpy as np
import pandas as pd
x = np.random.normal (0,1, 1000000)
y = x*2
dataset = pd.DataFrame({'Column1': x, 'Column2': y})
print (dataset.info(memory_usage='deep'))
dataset.to_hdf('data.h5', key='df', mode='w')
dataset2 = pd.read_hdf("data.h5")
print (dataset2.info(memory_usage='deep'))
dataset2.to_hdf('data2.h5', key='df', mode='w')
dataset3 = pd.read_hdf("data2.h5")
print (dataset3.info(memory_usage='deep'))
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1 1000000 non-null float64
Column2 1000000 non-null float64
dtypes: float64(2)
memory usage: 15.3 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1 1000000 non-null float64
Column2 1000000 non-null float64
dtypes: float64(2)
memory usage: 22.9 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1 1000000 non-null float64
Column2 1000000 non-null float64
dtypes: float64(2)
memory usage: 22.9 MB
None
It is happening because RangeIndex
is converted to Int64Index
on save. Is there a way to optimise this ? Looks like there's no way to drop the index:
https://github.com/pandas-dev/pandas/issues/8319
Upvotes: 0
Views: 112
Reputation: 882
The best solution I found till now is to save as pickle:
dataset.to_pickle("datapkl.pkl")
less convenient option is to convert to numpy and save with h5py
, but then loading and converting back to pandas takes a lot of time
a = dataset.to_numpy()
h5f = h5py.File('datah5.h5', 'w')
h5f.create_dataset('dataset_1', data=a)
Upvotes: 1