mikanim
mikanim

Reputation: 439

Reading an hdf5 file only after it has completely finished acquiring data

Data will be saved into hdf5 files but the saving takes roughly 30 seconds in total for one file. Once the data is done being saved in one hdf5 file, the file will be used immediately until the next hdf5 file is done and the process will continue like so. Is there a simple way to check if an hd5 file is done loading and only then can it be used? The hdf5 files are roughly 10-20MB and will all be saved in the same folder. Of course I could perhaps set a timer above 30 seconds of some sort but I am interested in keeping the time as low as possible which means I need to know exactly when each hdf5 file is finished acquiring data.

A couple of ideas I have:

  1. Measuring the difference in file size from one point in time to another. If there is no change then it is assumed the file is done loading.
  2. I don't know much about hdf5 files but perhaps there is something at the end of every hdf5 file and only at the end. If that is the case, I could keep checking if the values of the last component is there. If it is there then the file must be finished.

Any thoughts? I would definitely appreciate any help.

Edit: My idea with the hdf5 part inside on_created:

class CustomHandler(FileSystemEventHandler):    

    def __init__(self, callback: Callable):
        self.callback = callback

        # Store callback to be called on every on_created event

    def on_created(self, event: Union[DirCreatedEvent, FileCreatedEvent]):
        #print(f"Event type: {event.event_type}\nAt: {event.src_path}\n")

        # check if it's File creation, not Directory creation
        if isinstance(event, FileCreatedEvent):
            file = pathlib.Path(event.src_path)

            #print(f"Processing file {file.name}\n")

            # call callback
            #self.callback(file)

            wait = 3
            max_wait = 30
            waited = 0

            while True:
                try:
                    h5py.File(self.callback(file), 'r')
                    return self.callback(file)

                except FileNotFoundError:
                    print('Error: HDF5 File not found')
                    return None

                except OSError:
                    if waited < max_wait:
                        print(f'Error: HDF5 File locked, sleeping {wait} seconds...')
                        time.sleep(wait)
                        waited += wait
                    else:
                        print(f'waited too long= {waited} secs')
                        return None

Upvotes: 0

Views: 1534

Answers (2)

kcw78
kcw78

Reputation: 8006

Based on your comments and our discussion, the easiest implementation might be a function that "waits" for the file, but does not return the h5py file object. This way you still use the standard context manager: (e.g., with h5py.File() as h5f:) and avoid the need to close the file in the main program.

I am posting the modified function as a new answer (renamed to h5_wait) to avoid confusion (my first answer has the original function h5_open_wait). This function is similar, but returns a True/False flag instead of a h5py file object. It checks the file status by calling h5py.File(), then closes before exiting the function. Also it uses a sys.argv to get the HDF5 filename (as sys.argv[1]).

See new code below:

import h5py
import sys
import time

def h5_wait(h5file):
    
    wait = 3
    max_wait = 30
    waited = 0

    while True:
        try:
            h5f = h5py.File(h5file,'r')
            break
                
        except FileNotFoundError:
            print('\nError: HDF5 File not found\n')
            return False
        
        except OSError:   
            if waited < max_wait:
                print(f'Warning: HDF5 File locked, sleeping {wait} seconds...')
                time.sleep(wait) 
                waited += wait  
            else:
                print(f'\nWaited too long= {waited} secs, exiting...\n')
                return False

    h5f.close()
    return True

####################

if len(sys.argv) != 2:
    sys.exit('Include HDF5 file name on command line.')
h5file = sys.argv[1]         

h5stat = h5_wait(h5file)
if h5stat is False:
    sys.exit('Error: HDF5 File not available')
    
with h5py.File(h5file) as h5f:
    # do something with the file      
    start = time.time()
    for ds, obj in h5f.items():
        print(f'ds name={ds}; shape={obj.shape}')
      
    print(f'\nTime to read {len(list(h5f.keys()))} datasets = {time.time()-start:.2f} secs')  

Upvotes: 1

kcw78
kcw78

Reputation: 8006

What you want is 'file locking'. The good news: This is enabled (by default) with HDF5 libraries builds. AND, better yet, it is enabled in the h5py package! So, you will get an exception if you try to open a file that is open for writing by another program. We can use that exception to our advantage. The challenge is differentiating the file locked exception from other potential file open exceptions (like file does't exist).

Frankly, I prefer Python's with/as: context manager to open files. However, it handles all exceptions the same (doesn't open, and exits). So, we need a way to handle different exceptions differently. I suspect a custom file context manager is the most Pythonic way to do this. However, that goes beyond my expertise.

Instead, I wrote a function you call with the filenameIt uses try/except: in a while loop to open the file. One of 3 things will happen:

  1. It returns the h5py file object if it opens the file.
  2. It immediately returns Noneif the file doesn't exist.
  3. If it is locked, it sleeps, then retries again. If it can't open after the time limit, None is returned.

Remember to use the .close() method when using this function!

Code updated 2021-09-09 to pass HDF5 file name as required command line argument using argparse module.
updated Code below:

import h5py
import argparse
import sys 
import time

def h5_open_wait(h5file):
    
    wait = 3
    max_wait = 30
    waited = 0

    while True:
        try:
            h5f = h5py.File(h5file,'r')
            return h5f
                
        except FileNotFoundError:
            print('Error: HDF5 File not found')
            return None
        
        except OSError:   
            if waited < max_wait:
                print(f'Error: HDF5 File locked, sleeping {wait} seconds...')
                time.sleep(wait) 
                waited += wait  
            else:
                print(f'waited too long= {waited} secs')
                return None

def get_job_options():

# Note that HDF5 file name is only parameter and is required; 
 
    parser = argparse.ArgumentParser(description='Check HDF5 file is available to open.')
    parser.add_argument('hdf5', help='HDF5 filename (Required)' )

    if len(sys.argv)==1:
    # display help message when no args are passed.
        parser.print_help()
        sys.exit('Error: No HDF5 file name specified; exiting.')

    args = parser.parse_args()
    
    HDF5_FILE = args.hdf5
    #print ('HDF5 file = %s' % args.hdf5)

    return (HDF5_FILE)

####################

h5file  = get_job_options()

start = time.time()

h5f = h5_open_wait(h5file)
if h5f is None:
    sys.exit('Error: HDF5 File not opened')
    
# do something with the file      
for ds, obj in h5f.items():
    print(f'ds name={ds}; shape={obj.shape}')

h5f.close()     
print(f'\nTime to read all datasets = {time.time()-start:.2f} secs')  

To test, I wrote a simple program that creates 800 data sets from a large array. (Code below.) To test, start this first, then run the code above to see how it waits. Adjust max_wait above and a0 and cnt below as appropriate for your system speed.

Code to create the example file used above:

start = time.time()
a0 = 1000
cnt = 800
arr = np.random.random(a0*a0).reshape(a0,a0)
with h5py.File('SO_69067142.h5','w') as h5f:
    for dcnt in range(cnt):
        h5f.create_dataset(f'ds_{dcnt:03}',data=arr)

print(f'Time to create {cnt} datasets={time.time()-start:.2f}')   

Upvotes: 1

Related Questions