piRSquared
piRSquared

Reputation: 294258

Can I update an HDFStore?

Consider the following hdfstore and dataframes df and df2

import pandas as pd

store = pd.HDFStore('test.h5')

midx = pd.MultiIndex.from_product([range(2), list('XYZ')], names=list('AB'))
df = pd.DataFrame(dict(C=range(6)), midx)

df

     C
A B   
0 X  0
  Y  1
  Z  2
1 X  3
  Y  4
  Z  5

midx2 = pd.MultiIndex.from_product([range(2), list('VWX')], names=list('AB'))
df2 = pd.DataFrame(dict(C=range(6)), midx2)

df2

     C
A B   
0 V  0
  W  1
  X  2
1 V  3
  W  4
  X  5

I want to first write df to the store.

store.append('df', df)

store.get('df')

     C
A B   
0 X  0
  Y  1
  Z  2
1 X  3
  Y  4
  Z  5

At a later point in time I will have another dataframe that I want to update the store with. I want to overwrite the rows with the same index values as are in my new dataframe while keeping the old ones.

When I do

store.append('df', df2)

store.get('df')

     C
A B   
0 X  0
  Y  1
  Z  2
1 X  3
  Y  4
  Z  5
0 V  0
  W  1
  X  2
1 V  3
  W  4
  X  5

This isn't at all what I want. Notice that (0, 'X') and (1, 'X') are repeated. I can manipulate the combined dataframe and overwrite, but I expect to be working with a lot data where this wouldn't be feasible.

How do I update the store to get?

     C
A B   
0 V  0
  W  1
  X  2
  Y  1
  Z  2
1 V  3
  W  4
  X  5
  Y  4
  Z  5

You'll see that For each level of 'A', 'Y' and 'Z' are the same, 'V' and 'W' are new, and 'X' is updated.

What is the correct way to do this?

Upvotes: 5

Views: 2280

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

Idea: remove matching rows (with matching index values) from the HDF first and then append df2 to HDFStore.

Problem: I couldn't find a way to use where="index in df2.index" for multi-index indexes.

Solution: first convert multiindexes to normal ones:

df.index = df.index.get_level_values(0).astype(str) + '_' + df.index.get_level_values(1).astype(str)

df2.index = df2.index.get_level_values(0).astype(str) + '_' + df2.index.get_level_values(1).astype(str)

this yields:

In [348]: df
Out[348]:
     C
0_X  0
0_Y  1
0_Z  2
1_X  3
1_Y  4
1_Z  5

In [349]: df2
Out[349]:
     C
0_V  0
0_W  1
0_X  2
1_V  3
1_W  4
1_X  5

make sure that you use format='t' and data_columns=True (this will index save index and index all columns in the HDF5 file, allowing us to use them in the where clause) when you create/append HDF5 files:

store = pd.HDFStore('d:/temp/test1.h5')
store.append('df', df, format='t', data_columns=True)
store.close()

now we can first remove those rows from the HDFStore with matching indexes:

store = pd.HDFStore('d:/temp/test1.h5')

In [345]: store.remove('df', where="index in df2.index")
Out[345]: 2

and append df2:

In [346]: store.append('df', df2, format='t', data_columns=True, append=True)

Result:

In [347]: store.get('df')
Out[347]:
     C
0_Y  1
0_Z  2
1_Y  4
1_Z  5
0_V  0
0_W  1
0_X  2
1_V  3
1_W  4
1_X  5

Upvotes: 4

Related Questions