Mahesh
Mahesh

Reputation: 1016

How to create datasets within a group in hdf5 file?

I want to create a group with the path "particles/lipids/positions" that contain two datasets e.g. particle is the main group and lipids is datasets contain lipid names and positions will contain the positions of the lipid in each frame. I tried this but getting the following error in the line 40 of the code from previous answers code source:

 ValueError: Unable to create group (name already exists)

import struct
import numpy as np
import h5py

csv_file = 'com'

fmtstring = '7s 8s 5s 7s 7s 7s'
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from

#define a np.dtype for gro array/dataset (hard-coded for now)
gro_dt = np.dtype([('col1', 'S7'), ('col2', 'S8'), ('col3', int), 
                   ('col4', float), ('col5', float), ('col6', float)])

with open(csv_file, 'r') as f, \
     h5py.File('xaa.h5', 'w') as hdf:
         
    step = 0
    while True:         
        header = f.readline()
        if not header:
            print("End Of File")
            break
        else:
            print(header)

        # get number of data rows
        no_rows = int(f.readline())
        arr = np.empty(shape=(no_rows,), dtype=gro_dt)
        for row in range(no_rows):
            fields = parse( f.readline().encode('utf-8') )
            arr[row]['col1'] = fields[0].strip()            
            arr[row]['col2'] = fields[1].strip()            
            arr[row]['col3'] = int(fields[2])
            arr[row]['col4'] = float(fields[3])
            arr[row]['col5'] = float(fields[4])
            arr[row]['col6'] = float(fields[5])
        if arr.shape[0] > 0:
            # Create a froup to store positions
            particles_grp = hdf.create_group('particles/lipids/positions')
            # create a dataset for THIS time step
            ds= particles_grp.create_dataset(f'dataset_{step:04}', data=arr,compression='gzip') 
            #ds= hdf.create_dataset(f'dataset_{step:04}', data=arr,compression='gzip') 
            #create attributes for this dataset / time step
            hdr_tokens = header.split()
            particles_grp['ds'] = ds
            ds.attrs['raw_header'] = header
            #ds.attrs['Generated by'] = hdr_tokens[2]
            #ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
            ds.attrs['Time'] = hdr_tokens[6]
            
        footer = f.readline()
        step += 1

The small data file is linked here data file. In the present code each frame is stored in dataset1, 2...so on. I want these datasets to be stored in the particles group. I'm not quite sure if this is the best method for later, because I want to use these frames for further calculations!! Thanks!!

Upvotes: 0

Views: 3530

Answers (2)

kcw78
kcw78

Reputation: 8081

As noted in the previous answer, you try to create the same group inside the while loop with this function:
particles_grp = hdf.create_group('particles/lipids/positions')
You get an error the second time you call it (b/c the group already exits).

Instead, use this function to create the group object:
particles_grp = hdf.require_group('particles/lipids/positions')

require_group() is smart (and useful). If the group doesn't exist, it will create it. And, when the group already exists, it will simply return the group object.

Make that change to your code and it will work with no other changes.

Alternately, you can move the create_group() call ABOVE the while True: loop (so it is only called once).

Upvotes: 1

Jeremy Savage
Jeremy Savage

Reputation: 894

You are running the line of code:

particles_grp = hdf.create_group('particles/lipids/positions')

Inside your while loop. This means you are trying to create the group inside the hdf5 file more than once which is not possible (as the name is hardcoded). Try something like this.

with open(csv_file, 'r') as f, \
     h5py.File('xaa.h5', 'w') as hdf:
    # Create a froup to store positions
    particles_grp = hdf.create_group('particles/lipids/positions')
    step = 0
    while True:         
        header = f.readline()
        if not header:
            print("End Of File")
            break
        else:
            print(header)

        # get number of data rows
        no_rows = int(f.readline())
        arr = np.empty(shape=(no_rows,), dtype=gro_dt)
        for row in range(no_rows):
            fields = parse( f.readline().encode('utf-8') )
            arr[row]['col1'] = fields[0].strip()            
            arr[row]['col2'] = fields[1].strip()            
            arr[row]['col3'] = int(fields[2])
            arr[row]['col4'] = float(fields[3])
            arr[row]['col5'] = float(fields[4])
            arr[row]['col6'] = float(fields[5])
        if arr.shape[0] > 0:
            # create a dataset for THIS time step
            ds= particles_grp.create_dataset(f'dataset_{step:04}', data=arr,compression='gzip') 
            #ds= hdf.create_dataset(f'dataset_{step:04}', data=arr,compression='gzip') 
            #create attributes for this dataset / time step
            hdr_tokens = header.split()
            particles_grp['ds'] = ds
            ds.attrs['raw_header'] = header
            #ds.attrs['Generated by'] = hdr_tokens[2]
            #ds.attrs['P/L'] = hdr_tokens[4].split('=')[1]
            ds.attrs['Time'] = hdr_tokens[6]
            
        footer = f.readline()
        step += 1

I assume this is the issue from the error message, give this a go and let me know if it works.

HDF5 uses hierarchical file structure similar to your file system. Imagine you are trying to create two directories (folders) of the same name. You can only have one folder of the the same. So create the group or folder first and then put the files (datasets) in the group.

EDIT: it looks like you are going to run into a further issue here

particles_grp['ds'] = ds

You need to create custom names for your datasets in your group as you cannot have 2 of the same name.

Try something like this:

particles_grp[f'dataset_{step:04}'] = ds

Upvotes: 1

Related Questions