matousc
matousc

Reputation: 3977

Why if I put multiple empty Pandas series into hdf5 the size of hdf5 is so huge?

If I create hdf5 file with pandas with following code:

import pandas as pd

store = pd.HDFStore("store.h5")

for x in range(1000):
    store["name"+str(x)] = pd.Series()

all series are empty, so why "store.h5" file takes 1.1GB space on hardrive?

Upvotes: 6

Views: 214

Answers (1)

Andreus
Andreus

Reputation: 2487

Short version: You have found a bug. Quoting this bug on GitHub:

...required a bit of a hackjob (pytables doesn't like zero-length objects)

I can reproduce this error on my machine. Simply changing your code to this:

import pandas as pd
store = pd.HDFStore("store.h5")
for x in range(1000):
    store["name"+str(x)] = pd.Series([1,2])

results in a sane megabyte-scale file. I cannot find an open bug on Github; you might try reporting it.

I assume you've already dealt with the issue in your code, but if you haven't, you should probably just check to make sure that no array dimensions are zero before storing an object:

toStore=pd.Series()
assert not np.prod( toStore.shape )==0, 'Tried to store an empty object!'

Upvotes: 2

Related Questions