Why are CSV files smaller than HDF5 files when writing with Pandas?

Question

import numpy as np
import pandas as pd

df = pd.DataFrame(data=np.zeros((1000000,1)))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')

ls -sh test*
11M test.csv  16M test.h5

If I use an even larger dataset then the effect is even bigger. Using an HDFStore like below changes nothing.

store = pd.HDFStore('test.h5', table=True)
store['df'] = np.zeros((1000000,1))
store.close()

Edit: Never mind. The example is bad! Using some non-trivial numbers instead of zeros changes the story.

from numpy.random import rand
import pandas as pd

df = pd.DataFrame(data=rand(10000000,1))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')

ls -sh test*
260M test.csv  153M test.h5

Expressing numbers as floats should take less bytes than expressing them as strings of characters with one character per digit. This is generally true, except in my first example, in which all the numbers were '0.0'. Thus, not many characters were needed to represent the number, and so the string representation was smaller than the float representation.

chw21 · Accepted Answer

For .csv, your method stores characters like this:

999999,0.0

That's up to 11 characters per value. At 1 million values, this comes to close to 11MB.

HD5 seems to store each value as 16 byte floating point number, never mind that it's the same value over and over. So this is 16 byte * 1,000,000, which is roughly 16 MB.

Store not a 0.0, but some random data, and the .csv quickly blows off to 25MB and more, while the HDF5 file stays the same size. And while the csv file looses accuracy, the HDF5 retains it.

Why are CSV files smaller than HDF5 files when writing with Pandas?

Answers (2)

Related Questions