Reputation: 294218
I am going to be iterating through many data frames to append to a table in an hdfs store. The indices will overlap with each other. I want to append only the rows with indices that aren't already in the store.
Consider my data frames d1
and d2
:
d1 = pd.DataFrame.from_dict(
{('a', 'x'): {'col': 1}, ('a', 'y'): {'col': 1}}, orient='index')
d2 = pd.DataFrame.from_dict(
{('b', 'x'): {'col': 2}, ('a', 'y'): {'col': 2}}, orient='index')
print(d1, '\n\n', d2)
col
a x 1
y 1
col
a y 2
b x 2
I want to accomplish the same logic as the following:
d1.append(d2.loc[d2.index.difference(d1.index)])
col
a x 1
y 1
b x 2
But I want this for appending to the hdfs store.
d1.to_hdf('test.h5', 'mytable', format='table')
d2.to_hdf('test.h5', 'mytable', append=True)
pd.read_hdf('test.h5', 'mytable')
col
a x 1
y 1
y 2
b x 2
You can see that the index ('a', 'y')
is duplicated with the two different values. I'm assuming there is a way to check the index values in the table prior to appending new rows to the table.
Upvotes: 3
Views: 507
Reputation: 16079
It might help to initialize the store first. Then you should be able to assign a dataframe to mytable
and work with it just like you did in your dataframe only example.
store = pd.HDFStore('test.h5')
store['mytable'] = d1
store['mytable'].append(d2.loc[d2.index.difference(store['mytable'].index)])
col
a x 1
y 1
b x 2
Upvotes: 1