mrks
mrks

Reputation: 311

hdf to ndarray in numpy - fast way

I am looking for a fast way to set my collection of hdf files into a numpy array where each row is a flattened version of an image. What I exactly mean:

My hdf files store, beside other informations, images per frames. Each file holds 51 frames with 512x424 images. Now I have 300+ hdf files and I want the image pixels to be stored as one single vector per frame, where all frames of all images are stored in one numpy ndarray. The following picture should help to understand:

Visualized process of transforming many hdf files to one numpy array

What I got so far is a very slow method, and I actually have no idea how i can make it faster. The problem is that my final array is called too often, as far as I think. Since I observe that the first files are loaded into the array very fast but speed decreases fast. (observed by printing the number of the current hdf file)

My current code:

os.chdir(os.getcwd()+"\\datasets")

# predefine first row to use vstack later
numpy_data = np.ndarray((1,217088))

# search for all .hdf files
for idx, file in enumerate(glob.glob("*.hdf5")):
  f = h5py.File(file, 'r')
  # load all img data to imgs (=ndarray, but not flattened)
  imgs = f['img']['data'][:]

  # iterate over all frames (50)
  for frame in range(0, imgs.shape[0]):
    print("processing {}/{} (file/frame)".format(idx+1,frame+1))
    data = np.array(imgs[frame].flatten())
    numpy_data = np.vstack((numpy_data, data))

    # delete first row after another is one is stored
    if idx == 0 and frame == 0:
        numpy_data = np.delete(numpy_data, 0,0)

f.close()

For further information, I need this for learning a decision tree. Since my hdf file is bigger than my RAM, I think converting into a numpy array save memory and is therefore better suited.

Thanks for every input.

Upvotes: 0

Views: 1801

Answers (2)

max9111
max9111

Reputation: 6482

Do you really wan't to load all Images into the RAM and not use a single HDF5-File instead? Accessing a HDF5-File can be quite fast if you don't make any mistakes (unnessesary fancy indexing, improper chunk-chache-size). If you wan't the numpy-way this would be a possibility:

os.chdir(os.getcwd()+"\\datasets")
img_per_file=51

# get all HDF5-Files
files=[]
for idx, file in enumerate(glob.glob("*.hdf5")):
    files.append(file)

# allocate memory for your final Array (change the datatype if your images have some other type)
numpy_data=np.empty((len(files)*img_per_file,217088),dtype=np.uint8)

# Now read all the data
ii=0
for i in range(0,len(files)):
    f = h5py.File(files[0], 'r')
    imgs = f['img']['data'][:]
    f.close()
    numpy_data[ii:ii+img_per_file,:]=imgs.reshape((img_per_file,217088))
    ii=ii+img_per_file

Writing your data to a single HDF5-File would be quite similar:

f_out=h5py.File(File_Name_HDF5_out,'w')
# create the dataset (change the datatype if your images have some other type)
dset_out = f_out.create_dataset(Dataset_Name_out, ((len(files)*img_per_file,217088), chunks=(1,217088),dtype='uint8')

# Now read all the data
ii=0
for i in range(0,len(files)):
    f = h5py.File(files[0], 'r')
    imgs = f['img']['data'][:]
    f.close()
    dset_out[ii:ii+img_per_file,:]=imgs.reshape((img_per_file,217088))
    ii=ii+img_per_file

f_out.close()

If you only wan't to access whole images afterwards the chunk-size should be okay. If not you have to change that to your needs.

What you should do when accessing a HDF5-File:

  • Use a chunk-size, which fits your needs.

  • Set a proper chunk-chache-size. This can be done with the h5py low level api or h5py_cache. https://pypi.python.org/pypi/h5py-cache/1.0

  • Avoid any type of fancy indexing. If your Dataset has n dimensions access it in a way that the returned Array has also n dimensions.

    # Chunk size is [50,50] and we iterate over the first dimension
    numpyArray=h5_dset[i,:] #slow
    numpyArray=np.squeeze(h5_dset[i:i+1,:]) #does the same but is much faster
    

EDIT This shows how to read your data to a memmaped numpy array. I think your method expects data of format np.float32. https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html#numpy.memmap

 numpy_data = np.memmap('Your_Data.npy', dtype='np.float32', mode='w+', shape=((len(files)*img_per_file,217088)))

Everything else could be kept the same. If it works I would also recommend to use a SSD instead of a hardisk.

Upvotes: 1

hpaulj
hpaulj

Reputation: 231385

I don't think you need to iterate over

imgs = f['img']['data'][:]

and reshape each 2d array. Just reshape the whole thing. If I understand your description right, imgs is a 3d array: (51, 512, 424)

imgs.reshape(51, 512*424)

should be the 2d equivalent.

If you must loop, don't use vstack (or some variant to build a bigger array). One, it is slow, and two it's a pain to cleanup the initial 'dummy' entry. Use list appends, and do the stacking once, at the end

alist = []
for frame....
   alist.append(data)
data_array = np.vstack(alist)

vstack (and family) takes a list of arrays as input, so it can work with many at once. List append is much faster when done iteratively.

I question whether putting things into one array will help. I don't know exactly how the the size of a hdf5 file relates to the size of the downloaded array, but I expect they are in the same order of magnitude. So trying to load all 300 files into memory might not work. That's what, 3G of pixels?

For an individual file, h5py has provision for loading chunks of an array that is too large to fit in memory. That indicates that often the problem goes the other way, the file holds more than fits.

Is it possible to load large data directly into numpy int8 array using h5py?

Upvotes: 1

Related Questions