How can I save dataset to the correct HDF5 group?

Question

I am running a script that iterates through a folder and creates a set of groups to replicate the directory in a HDF5 file. I am then going through the files and adding the data from the files to HDF5 datasets. However, I am looking to save the datasets into the correct groups but unsure how?

#Development Script Save files in groups
#Create HDF5 File name
TestFilename = 'N:/TestingPyhonHDF5/progress/automation/GroupTesting.h5'
#Set target directory to extract data 
TargetFolder = 'N:\Measurements\T2+\Rx-BB001'
# giving file extensions
bin = ('.bin')
csv = ('.CSV')
tmp = ('.tmp')
head = ('.head')
ext = ('.bin', '.head', '.tmp', '.CSV')

# Create HDF5 Strucutre
with h5py.File(TestFilename,'w') as tf:
    for root, dirs, _ in os.walk(TargetFolder, topdown=True):
        #print(f'ROOT: {root}')
        # for Windows, modify root: remove drive letter and replace backslashes:
        grp_name = root[2:].replace( '\', '/')
        #print(f'grp_name: {grp_name}
')
        tf.create_group(grp_name)

#Open HDF5 file
with h5py.File(TestFilename,'a') as tfile:
#Iterate files to send to HDF5 file
    for path, dirc, files in os.walk(TargetFolder):
        for file in files:
            if file.endswith(bin):
                # Create a dtype with the binary data format and the desired column names
                filePath = os.path.join(path, file)
                dt = np.dtype('B')
                data = np.fromfile(filePath, dtype=dt)
                df = pd.DataFrame(data)

                #Save as csv
                savetxt('TempData.csv', df, delimiter=',')

                #Read bin to HDF5
                dfBIN = pd.read_csv('TempData.csv')  
                tfile.create_dataset(grp_name/file, data=dfBIN) #put data in hdf file
                #add attrs
                os.remove("TempData.csv")
            else:
                continue

Currently the code shows error

TypeError                                 Traceback (most recent call last)
Cell In [51], line 39
     37 #Read bin to HDF5
     38 dfBIN = pd.read_csv('TempData.csv')  
---> 39 tfile.create_dataset(grp_name/file, data=dfBIN) #put data in hdf file
     40 #add attrs
     41 os.remove("TempData.csv")

TypeError: unsupported operand type(s) for /: 'str' and 'str'

AKX · Accepted Answer

/ only works to join pathlib.Path objects and you have strings.

Just do

f"{grp_name}/{file}"

or

posixpath.join(grp_name, file)

(posixpath to ensure forward slashes no matter which platform you're on)

You will also naturally need to do the same grp_name determination in the second loop:

grp_name = path[2:].replace('\', '/')

Otherwise you're using the last value of grp_name from the earlier loop.

All in all, you may just want to go for a single loop. Since I don't know if h5py ignores attempting to (re)create a group that already exists, I added a set that keeps track of paths already created. Also, the intermediate CSV file seemed quite extraneous.

import os
import posixpath
import pandas as pd
import numpy as np

import h5py

TestFilename = "N:/TestingPyhonHDF5/progress/automation/GroupTesting.h5"
TargetFolder = "N:\Measurements\T2+\Rx-BB001"

groups_created = set()

with h5py.File(TestFilename, "a") as tfile:
    for path, dirc, files in os.walk(TargetFolder):
        for file in files:
            if file.endswith(".bin"):
                grp_name = path[2:].replace("\", "/")
                if grp_name not in groups_created:
                    tfile.create_group(grp_name)
                    groups_created.add(grp_name)
                # Create a dtype with the binary data format and the desired column names
                filePath = os.path.join(path, file)
                dt = np.dtype("B")
                data = np.fromfile(filePath, dtype=dt)
                df = pd.DataFrame(data)
                tfile.create_dataset(posixpath.join(grp_name, file), data=df)

How can I save dataset to the correct HDF5 group?

Answers (1)

Related Questions