Reputation: 5477
I have a multidimensional pandas dataframe created like this:
import numpy as np
import pandas as pd
iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
mindex = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=mindex)
store = pd.HDFStore("df.h5")
store["df"] = df
store.close()
I would like to add attributes to df
stored in the HDFStore. How can I do this? There doesn't seem to be any documentation regarding the attributes, and the group that is used to store the df
is not of the same type as the HDF5 Group in the h5py
module:
type(list(store.groups())[0])
Out[24]: tables.group.Group
It seems to be the pytables group, that has only this private member function that concerns some other kind of attribute:
__setattr__(self, name, value)
| Set a Python attribute called name with the given value.
What I would like is to simply store a bunch of DataFrames with multidimensional indices that are "marked" by attributes in a structured way, so that I can compare them and sub-select them based on those attributes.
Basically what HDF5 is meant to be used for + multidim DataFrames from pandas.
There are questions like this one, that deal with reading HDF5 files with other readers than pandas, but they all have DataFrames with one-dim indices, which makes it easy to simply dump numpy ndarrays, and store the index additionally.
Upvotes: 4
Views: 3153
Reputation: 8583
Adding attributes to a group from within pandas seems to be available by now (could not find out since which release, tested code snippet with pandas 1.4.2 and Python 3.10.4). According to pandas' HDF cookbook the following approach can be used:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(8, 3))
store = pd.HDFStore("test.h5")
store.put("df", df)
store.get_storer("df").attrs.my_attribute = {"A": 10}
store.close()
The HDFStore()
does provide a contextmanager as well:
with pd.HDFStore("test.h5") as store:
store.put("df", df)
store.get_storer("df").attrs.my_attribute = {"A": 10}
Please mind, that the attribute's name can be set as you like (data_origin
in the following) and does not need to be a dictionary mandatorily:
store.get_storer("df").attrs.data_origin = 'random data generation'
Upvotes: 1
Reputation: 5477
I haven't gotten any answers so far, and this is what I managed to do using both the pandas
and the h5py
modules: pandas
is used to store and read the multidimensional DataFrame, and h5py
to store and read the attributes of the HDF5 group:
import numpy as np
import pandas as pd
import h5py
# Create a random multidim DataFrame
iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
mindex = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=mindex)
pdStore = pd.HDFStore("df.h5")
h5pyFile = h5py.File("df.h5")
# Dumping the data and storing the attributes
pdStore["df"] = df
h5pyFile["/df"].attrs["number"] = 1
# Reading the data conditionally based on stored attributes.
dfg = h5pyFile["/df"]
readDf = pd.DataFrame()
if dfg.attrs["number"] == 1:
readDf = pdStore["/df"]
print (readDf - df)
h5pyFile.close()
pdStore.close()
I still don't know if there are any issues in having both the h5py
and pandas
handling the .h5
file simultaneously.
Upvotes: 3