jeffalstott
jeffalstott

Reputation: 2693

Why are CSV files smaller than HDF5 files when writing with Pandas?

import numpy as np
import pandas as pd

df = pd.DataFrame(data=np.zeros((1000000,1)))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')

ls -sh test*
11M test.csv  16M test.h5

If I use an even larger dataset then the effect is even bigger. Using an HDFStore like below changes nothing.

store = pd.HDFStore('test.h5', table=True)
store['df'] = np.zeros((1000000,1))
store.close()

Edit: Never mind. The example is bad! Using some non-trivial numbers instead of zeros changes the story.

from numpy.random import rand
import pandas as pd

df = pd.DataFrame(data=rand(10000000,1))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')

ls -sh test*
260M test.csv  153M test.h5

Expressing numbers as floats should take less bytes than expressing them as strings of characters with one character per digit. This is generally true, except in my first example, in which all the numbers were '0.0'. Thus, not many characters were needed to represent the number, and so the string representation was smaller than the float representation.

Upvotes: 5

Views: 3143

Answers (2)

chw21
chw21

Reputation: 8140

For .csv, your method stores characters like this:

999999,0.0<CR>

That's up to 11 characters per value. At 1 million values, this comes to close to 11MB.

HD5 seems to store each value as 16 byte floating point number, never mind that it's the same value over and over. So this is 16 byte * 1,000,000, which is roughly 16 MB.

Store not a 0.0, but some random data, and the .csv quickly blows off to 25MB and more, while the HDF5 file stays the same size. And while the csv file looses accuracy, the HDF5 retains it.

Upvotes: 2

Dirk is no longer here
Dirk is no longer here

Reputation: 368181

Briefly:

  • csv files are 'dumb': it is one character at a time, so if you print the (say, four-byte) float 1.0 to ten digits you really use that many bytes -- but the good news is that csv compresses well, so consider .csv.gz.

  • hdf5 is a meta-format and the No Free Lunch theorem still holds: the entries and values need to be stored somewhere. Which may make hdf5 larger.

But you are overlooking a larger issue: csv is just text. Which has limited precision -- whereas hdf5 is one of several binary (serialization) formats which store data to higher precision. It really is apples to oranges in that regard too.

Upvotes: 5

Related Questions