dot dot dot
dot dot dot

Reputation: 241

Pandas HDFStore for out-of-core Sequential read/write of sets with variable sizes

I want to read and write data to hdf5 file incrementally because I can't fit the data into memory.

The data to read/write is sets of integers. I only need to read/write the sets sequentially. No need for random access. Like I read set1, then set2, then set3, etc.

The problem is that I can't retrieve the sets by index.

import pandas as pd    
x = pd.HDFStore('test.hf', 'w', append=True)
a = pd.Series([1])
x.append('dframe', a, index=True)
b = pd.Series([10,2])
x.append('dframe', b, index=True)
x.close()

x = pd.HDFStore('test.hf', 'r')
print(x['dframe'])
y=x.select('dframe',start=0,stop=1)
print("selected:", y)
x.close()

Output:

0     1
0    10
1     2
dtype: int64
selected: 0    1
dtype: int64

It doesn't select my 0th set, which is {1,10}

Upvotes: 1

Views: 314

Answers (1)

dot dot dot
dot dot dot

Reputation: 241

This way works. But I really don't know how fast is this.

And does this scan the whole file to find rows with the index?

That would be quite a waste of time.

import pandas as pd

x = pd.HDFStore('test.hf', 'w', append=True, format="table", complevel=9)
a = pd.Series([1])
x.append('dframe', a, index=True)
b = pd.Series([10,2])
x.append('dframe', b, index=True)
x.close()

x = pd.HDFStore('test.hf', 'r')
print(x['dframe'])
y=x.select('dframe','index == 0')
print('selected:')
for i in y:
    print(i)
x.close()

Output:

0     1
0    10
1     2
dtype: int64
selected:
1
10

Upvotes: 1

Related Questions