piRSquared
piRSquared

Reputation: 294218

How do I append only the new indices to a table with an HDFS store

I am going to be iterating through many data frames to append to a table in an hdfs store. The indices will overlap with each other. I want to append only the rows with indices that aren't already in the store.


MCVE

Consider my data frames d1 and d2:

d1 = pd.DataFrame.from_dict(
    {('a', 'x'): {'col': 1}, ('a', 'y'): {'col': 1}}, orient='index')
d2 = pd.DataFrame.from_dict(
    {('b', 'x'): {'col': 2}, ('a', 'y'): {'col': 2}}, orient='index')

print(d1, '\n\n', d2)

     col
a x    1
  y    1 

      col
a y    2
b x    2

I want to accomplish the same logic as the following:

d1.append(d2.loc[d2.index.difference(d1.index)])

     col
a x    1
  y    1
b x    2

But I want this for appending to the hdfs store.

What I've Tried

d1.to_hdf('test.h5', 'mytable', format='table')
d2.to_hdf('test.h5', 'mytable', append=True)

pd.read_hdf('test.h5', 'mytable')

     col
a x    1
  y    1
  y    2
b x    2

You can see that the index ('a', 'y') is duplicated with the two different values. I'm assuming there is a way to check the index values in the table prior to appending new rows to the table.

Upvotes: 3

Views: 507

Answers (1)

Grr
Grr

Reputation: 16079

It might help to initialize the store first. Then you should be able to assign a dataframe to mytable and work with it just like you did in your dataframe only example.

store = pd.HDFStore('test.h5')

store['mytable'] = d1
store['mytable'].append(d2.loc[d2.index.difference(store['mytable'].index)])

     col
a x    1
  y    1
b x    2

Upvotes: 1

Related Questions