Reputation: 2014
I have astronomical data in columns, that look something like this
M200c, M200m, dec, ra
19.4, 20.4, 1.33, 4.68
...
I need to save my data in hdf5 format, so it can be fed into a script. I know the structure, that the hdf5 file should have, from an example that is provided, it delivers this structure:
import nexusformat.nexus as nx
f = nx.nxload('example_input_file.hdf5')
print(f.tree)
>>> root:NXroot
>>> Data:NXgroup
>>> M200c = float32(735697)
>>> M200m = float32(735697)
>>> dec = float32(735697)
>>> ra = float32(735697)
Naively, I thought that I can just load my data into a pandas df
, then save it to hdf5 like this
import pandas as pd
df ... # I do some data loading and processing here and eventually...
df.to_hdf('my_data_input_file.hdf5', key='df', mode='w')
but pandas
produces a very different and convoluted structure. Hence, when I feed my hdf5
input file to the script it gives me an error 'KeyError: 'Unable to open object (component not found)''.
So is there a way/package with which I can copy the structure of my example hdf5
and reproduce it when saving my data? Or can you provide me with a more hardcoded solution, maybe a loop through the names of all the columns that populates an empty hdf5
? I am completely new to this format and don't know how it works. Tnx
Upvotes: 0
Views: 767
Reputation: 8006
Yes, as you discovered, pandas uses predefined schemas when wrting HDF5 data and doesn't give you much control. I answered a similar question a few days ago. You can get close with the following pandas options: key='NXroot', format='table', data_columns=True
. However, you won't be able to mimic the schema exactly. See this answer for some examples of that behavior: Pandas to HDF5?
Both h5py and Pytables (aka tables) packages can be used to create an HDF5 file exactly as you desire. And, it's relatively easy to do with either of them once you know how to access the dataframe columns and write to individual datasets. Since PyTables is part of the Pandas HDF5 stack, it might be simpler (for you) to implement. That said, h5py is also popular. I use both packages, and like each for different reasons.
The process is similar with either package:
Code to create a simple dataframe to use in this example.
import pandas as pd
M200c = [ 19.4, 18.2, 11.5, 13.6, 27.1,
11.7, 15.5, 23.3, 31.1, 22.2 ]
M200m = [ 20.4, 15.7, 34.3, 18.0, 28.2,
16.5, 30.0, 24.4, 17.7, 15.9 ]
dec = [ 1.33, 1.81, 1.11, 2.15, 1.20,
1.92, 2.61, 3.22, 3.83, 4.07 ]
ra = [ 4.68, 4.81, 5.11, 5.25, 6.12,
7.92, 5.61, 3.22, 3.83, 4.07 ]
df = pd.DataFrame({'M200c': M200c, 'M200m': M200m, 'dec': dec, 'ra':ra})
Code to create the file using PyTables (tables):
import tables as tb
with tb.File('file_tb.h5', 'w') as h5f:
NXgrp = h5f.create_group('/','Data', createparents=True)
for (colName, colData) in df.items():
h5f.create_array(NXgrp, colName, obj=colData.values)
Code to create the file using h5py:
import h5py
with h5py.File('file_h5py.h5', 'w') as h5f:
NXgrp = h5f.create_group('Data')
for (colName, colData) in df.items():
NXgrp.create_dataset(colName, data=colData.values)
Upvotes: 1