Mattijn
Mattijn

Reputation: 13930

Iterate over HDFStore using chunksize saving into new HDFStore

I got all my data into a HDFStore (yeah!), but how to get it out of it..

I've saved 6 DataFrames as frame_table in my HDFStore. Each of these table looks like the following, but the length varies (date is Julian date).

>>> a = store.select('var1')
>>> a.head()
                      var1
x_coor y_coor date         
928    310    2006257   133
932    400    2006257   236
939    311    2006257   253
941    312    2006257   152
942    283    2006257    68

Then I select from all my tables the values where the date is e.g > 2006256.

>>> b = store.select_as_multiple(['var1','var2','var3','var4','var5','var6'], where=(pd.Term('date','>',date)), selector= 'var1')
>>> b.head()
                      var1   var2  var3  var4  var5  var6
x_coor y_coor date                                        
928    310    2006257   133  14987  7045    18   240   171
              2006273   136      0  7327    30   253   161
              2006289   125      0  -239    83   217   168
              2006305    95  14604  6786    13   215    57
              2006321    84      0  4548    13   133    88

This works, but only for the relatively small .h5 files. So for my normal .h5 files I would like to temporarily store it in a HDFStore using chunksize (since I've to add a new column based on this selection to it as well). I thought like this (using this):

for df in store.select_as_multiple(['var1','var2','var3','var4','var5','var6'], where=(pd.Term('date','>',date)), selector= 'var1', chunksize=15):
    tempstore.put('test',pd.DataFrame(df))

But then only one chunk is added to the store. But with:

tempstore.append('test',pd.DataFrame(df))

I get ValueError: Can only append to Tables. What I'm doing wrong?

Upvotes: 2

Views: 1750

Answers (1)

Jeff
Jeff

Reputation: 129068

When you tried to do this with put it kept overwriting the store (with the latest chunk), then you get the error when you append (because you can't append to a storer / non-table).

That is:

  • put writes a single, non-appendable fixed format (called a storer), which is fast to write, but you cannot append, nor query (only get it in its entirety).

  • append creates a table format, which is what you want here (and what a frame_table is).

Note: you don't need to do pd.DataFrame(df) as df is already a frame.

So, first do this (delete the store) if its there:

if 'test' in tempstore:
    tempstore.remove('test')

Then append each DataFrame:

for df in store.select_as_multiple(.....):
     tempstore.append('test', df)

Upvotes: 5

Related Questions