Colonder
Colonder

Reputation: 1576

How to deal with large amount of HDF5 files in Google Cloud Machine Learning?

I have approximately 5k raw data input files and 15k raw data test files, several GB in total. Since those are raw data files, I had to process them iteratively in Matlab in order to obtain features that I want to train my actual classifier on (CNN). As a result, I produced one HDF5 mat file for each of the raw data files. I developed my model locally with usage of Keras and modified DirectoryIterator in which I had something like

for i, j in enumerate(batch_index_array):
            arr = np.array(h5py.File(os.path.join(self.directory, self.filenames[j]), "r").get(self.variable))
            # process them further

Files structure is

|  
|--train  
|    |--Class1
|    |    |-- 2,5k .mat files  
|    |      
|    |--Class2
|         |-- 2,5k .mat files  
|--eval  
|    |--Class1
|    |    |-- 2k .mat files  
|    |      
|    |--Class2
|         |-- 13k .mat files

This is the files structure that I have right now in my Google ML Storage bucket. It was working locally with python with a small model but now I'd like to utilize Google ML hyper params tuning capabilities since my model is a lot bigger. The problem is that I read on the Internet that HDF5 files cannot be read directly and easily from the Google ML Storage. I tried to modify my script like this:

import tensorflow as tf
from tensorflow.python.lib.io import file_io

for i, j in enumerate(batch_index_array):
    with file_io.FileIO((os.path.join(self.directory, self.filenames[j], mode='r') as input_f:
        arr = np.array(h5py.File(input_f.read(), "r").get(self.variable))
        # process them further

but it's giving me the error similar to this error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte just with other hex and positon 512.
I also had something like this:

import tensorflow as tf
from tensorflow.python.lib.io import file_io

for i, j in enumerate(batch_index_array):
    with file_io.FileIO((os.path.join(self.directory, self.filenames[j], mode='rb') as input_f:
        arr = np.fromstring(input_f.read())
        # process them further

but it also doesn't work.

Question
How can I modify my script to be able to read those HDF5 files in the Google ML? I'm aware of the data pickling practice but the thing is that loading to memory a pickle created from 15k files (several GB) seems not very efficient.

Upvotes: 3

Views: 3455

Answers (2)

max9111
max9111

Reputation: 6492

Reading data from an temporary file-like object

I do not have direct access to Google ML so I have to apologize if this answer doesn't work. I did something similiar to directly read h5-files from a zipped folder, but I hope this will work here to.

from scipy import io
import numpy as np
from io import BytesIO

#Creating a Testfile
Array=np.random.rand(10,10,10)
d = {"Array":Array}
io.savemat("Test.mat",d)

#Reading the data using a in memory file-like object
with open('Test.mat', mode='rb') as input_f:
    output = BytesIO()
    num_b=output.write(input_f.read())
    ab = io.loadmat(output)

Upvotes: 0

rhaertel80
rhaertel80

Reputation: 8399

HDF is a very common file format that, unfortunately, is not optimal in the cloud. For some explanations why, please see this blog post.

Given the inherent complexities of HDF on cloud, I recommend one of the following:

  1. Convert your data to another file format such as CSV, or TFRecord of tf.Example
  2. Copy the data locally to /tmp

Conversion can be inconvenient at best, and, for some datasets, perhaps gymnastics will be necessary. A cursory search on the internet revealed multiple tutorials on how to do so. Here's one you might refer to.

Likewise, there are multiple ways to copy data on to the local machine, but beware that your job won't start doing any actual training until the data is copied. Also, should one of the workers dies, it will have to recopy all of the data when it starts up again. If the master dies and you are doing distributed training, this can cause a lot of work to be lost.

That said, if you feel this is a viable approach in your case (e.g., you're not doing distributed training and/or you're willing to wait for the data transfer as described above), just start your Python with something like:

import os
import subprocess

if os.environ.get('TFCONFIG', {}).get('task', {}).get('type') != 'ps':
  subprocess.check_call(['mkdir', '/tmp/my_files'])
  subprocess.check_call(['gsutil', '-m', 'cp', '-r', 'gs://my/bucket/my_subdir', '/tmp/myfiles']) 

Upvotes: 4

Related Questions