Google cloud AI platform error in executing job

Question

Using python googleapiclient API we are creating a job in the AI platform.

from oauth2client.client import GoogleCredentials
import datetime

credentials = GoogleCredentials.get_application_default()
training_inputs = {'scaleTier':'CUSTOM','masterType':'complex_model_m',
        'packageUris':['package_bucket_file_path'],

        'pythonModule':'randomforest_trainer_RUL.train',
        'args':[
                '--trainFilePath', data[0],
                '--trainOutputPath', data[2],
                '--testFilePath', data[1],
                '--testOutputPath', data[3],
                '--target', target_label,
                '--bucket', BUCKET,
                '--expid', experiment_id
        ],
        'region': "region_of_bucket",
        'runtimeVersion':'1.14',
        'pythonVersion':'3.5'}

timestamp = datetime.datetime.now().strftime('%y%m%d_%H%M%S%f')
job_name = "job_"+experiment_id

## logging information
logging.info("Job Name:{}".format(job_name))
##
api = discovery.build('ml', 'v1', credentials=credentials,cache_discovery=False)

project_id = 'projects/{}'.format(PROJECT)
credentials  = GoogleCredentials.get_application_default()
request = api.projects().jobs().create(body=job_spec, parent=project_id)

It was working and I am able to train the model, did the testing and prediction till yesterday. But all of sudden I'm not able to train the model in AI Platform and the error that I'm getting is

The replica master 0 exited with a non-zero status of 1. 
Traceback (most recent call last):
  [...]
  
    File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", line 810, in ls
    
    combined_listing = self._ls(path, detail) + self._ls(path + "/", detail)
  
    File "", line 2, in _ls
  
    File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", line 50, in _tracemethod
    
    return f(self, *args, **kwargs)
  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 820, in _ls
    listing = self._list_objects(path)
  File "", 
    line 2, in _list_objects
  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 50, in _tracemethod
return f(self, *args, **kwargs)
  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 616, in _list_objects
    listing = self._do_list_objects(path)
  File "", 
    line 2, in _do_list_objects
  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 50, in _tracemethod
    return f(self, *args, **kwargs)
  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 637, in _do_list_objects
    maxResults=max_results,
  File "", 
    line 2, in _call
  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 50, in _tracemethod
    return f(self, *args, **kwargs)
  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 517, in _call
    validate_response(r, path)
  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 171, in validate_response
    raise IOError("Forbidden: %s\n%s" % (path, msg))
OSError: 
    Forbidden: https://www.googleapis.com/storage/v1/b/some-storage-bucket/o/
service-87XX90XX1XX@cloud-ml.google.com.iam.gserviceaccount.com 
    does not have serviceusage.services.use access to project 34XX12XX12X.

To find out more about why your job exited 
    please check the logs: https://console.cloud.google.com/logs/viewer?project=87XX90XX1XX&resource=ml_job%2Fjob_id%2Fjob_5de3592da3c3c541d73389er&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22job_5de3592da3c3c541d73389erce%22

The error that I'm getting is

service-87XX90XX1XX@cloud-ml.google.com.iam.gserviceaccount.com 
    does not have serviceusage.services.use access to project 34XX12XX12X

Madhi · Accepted Answer

Had the exact problem today. As Nick said, it's the GCSFS new release problem. Instead of using pd.read_csv(gcs_path) I suggest you to read the CSV file from the bucket directly by Tensorflow GFile Function.

with tf.gfile.GFile(gcs_path) as f:
            if(opts):
                df = pd.read_csv(f, opts)
            else:
                df = pd.read_csv(f)
        return df

It will allow you to run the job without any breaking.

Google cloud AI platform error in executing job

Answers (1)

Related Questions