Reputation: 311
I am looking for a fast way to set my collection of hdf files into a numpy array where each row is a flattened version of an image. What I exactly mean:
My hdf files store, beside other informations, images per frames. Each file holds 51 frames with 512x424 images. Now I have 300+ hdf files and I want the image pixels to be stored as one single vector per frame, where all frames of all images are stored in one numpy ndarray. The following picture should help to understand:
What I got so far is a very slow method, and I actually have no idea how i can make it faster. The problem is that my final array is called too often, as far as I think. Since I observe that the first files are loaded into the array very fast but speed decreases fast. (observed by printing the number of the current hdf file)
My current code:
os.chdir(os.getcwd()+"\\datasets")
# predefine first row to use vstack later
numpy_data = np.ndarray((1,217088))
# search for all .hdf files
for idx, file in enumerate(glob.glob("*.hdf5")):
f = h5py.File(file, 'r')
# load all img data to imgs (=ndarray, but not flattened)
imgs = f['img']['data'][:]
# iterate over all frames (50)
for frame in range(0, imgs.shape[0]):
print("processing {}/{} (file/frame)".format(idx+1,frame+1))
data = np.array(imgs[frame].flatten())
numpy_data = np.vstack((numpy_data, data))
# delete first row after another is one is stored
if idx == 0 and frame == 0:
numpy_data = np.delete(numpy_data, 0,0)
f.close()
For further information, I need this for learning a decision tree. Since my hdf file is bigger than my RAM, I think converting into a numpy array save memory and is therefore better suited.
Thanks for every input.
Upvotes: 0
Views: 1801
Reputation: 6482
Do you really wan't to load all Images into the RAM and not use a single HDF5-File instead? Accessing a HDF5-File can be quite fast if you don't make any mistakes (unnessesary fancy indexing, improper chunk-chache-size). If you wan't the numpy-way this would be a possibility:
os.chdir(os.getcwd()+"\\datasets")
img_per_file=51
# get all HDF5-Files
files=[]
for idx, file in enumerate(glob.glob("*.hdf5")):
files.append(file)
# allocate memory for your final Array (change the datatype if your images have some other type)
numpy_data=np.empty((len(files)*img_per_file,217088),dtype=np.uint8)
# Now read all the data
ii=0
for i in range(0,len(files)):
f = h5py.File(files[0], 'r')
imgs = f['img']['data'][:]
f.close()
numpy_data[ii:ii+img_per_file,:]=imgs.reshape((img_per_file,217088))
ii=ii+img_per_file
Writing your data to a single HDF5-File would be quite similar:
f_out=h5py.File(File_Name_HDF5_out,'w')
# create the dataset (change the datatype if your images have some other type)
dset_out = f_out.create_dataset(Dataset_Name_out, ((len(files)*img_per_file,217088), chunks=(1,217088),dtype='uint8')
# Now read all the data
ii=0
for i in range(0,len(files)):
f = h5py.File(files[0], 'r')
imgs = f['img']['data'][:]
f.close()
dset_out[ii:ii+img_per_file,:]=imgs.reshape((img_per_file,217088))
ii=ii+img_per_file
f_out.close()
If you only wan't to access whole images afterwards the chunk-size should be okay. If not you have to change that to your needs.
What you should do when accessing a HDF5-File:
Use a chunk-size, which fits your needs.
Set a proper chunk-chache-size. This can be done with the h5py low level api or h5py_cache. https://pypi.python.org/pypi/h5py-cache/1.0
Avoid any type of fancy indexing. If your Dataset has n dimensions access it in a way that the returned Array has also n dimensions.
# Chunk size is [50,50] and we iterate over the first dimension
numpyArray=h5_dset[i,:] #slow
numpyArray=np.squeeze(h5_dset[i:i+1,:]) #does the same but is much faster
EDIT This shows how to read your data to a memmaped numpy array. I think your method expects data of format np.float32. https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html#numpy.memmap
numpy_data = np.memmap('Your_Data.npy', dtype='np.float32', mode='w+', shape=((len(files)*img_per_file,217088)))
Everything else could be kept the same. If it works I would also recommend to use a SSD instead of a hardisk.
Upvotes: 1
Reputation: 231385
I don't think you need to iterate over
imgs = f['img']['data'][:]
and reshape each 2d array. Just reshape the whole thing. If I understand your description right, imgs
is a 3d array: (51, 512, 424)
imgs.reshape(51, 512*424)
should be the 2d equivalent.
If you must loop, don't use vstack
(or some variant to build a bigger array). One, it is slow, and two it's a pain to cleanup the initial 'dummy' entry. Use list appends, and do the stacking once, at the end
alist = []
for frame....
alist.append(data)
data_array = np.vstack(alist)
vstack
(and family) takes a list of arrays as input, so it can work with many at once. List append is much faster when done iteratively.
I question whether putting things into one array will help. I don't know exactly how the the size of a hdf5
file relates to the size of the downloaded array, but I expect they are in the same order of magnitude. So trying to load all 300 files into memory might not work. That's what, 3G of pixels?
For an individual file, h5py
has provision for loading chunks of an array that is too large to fit in memory. That indicates that often the problem goes the other way, the file holds more than fits.
Is it possible to load large data directly into numpy int8 array using h5py?
Upvotes: 1