Reputation: 213
I'm processing large number of files in python and need to write the output (one dataframe for each input file) in HDF5
directly.
I am wondering what is the best way to write pandas
data frame from my script to HDF5
directly in a fast way? I am not sure if any python module like hdf5, hadoopy can do this. Any help in this regard will be appreciate.
Upvotes: 3
Views: 2947
Reputation: 210812
It's difficult to give you a good answer to this rather generic question.
It's not clear how are you going to use (read) your HDF5 files - do you want to select data conditionally (using where
parameter)?
fir of all you need to open a store object:
store = pd.HDFStore('/path/to/filename.h5')
now you can write (or append) to the store (i'm using here blosc
compression - it's pretty fast and efficient), beside that i will use data_columns
parameter in order to specify the columns that must be indexed (so you can use these columns in the where
parameter later when you will read your HDF5 file):
for f in files:
#read or process each file in/into a separate `df`
store.append('df_identifier_AKA_key', df, data_columns=[list_of_indexed_cols], complevel=5, complib='blosc')
store.close()
Upvotes: 2