Reputation: 2904
I have a model that was trained on a Machine Learning Compute on Azure Machine Learning Service. The registered model already lives my workspace and I would like to deploy it to a pre-existing AKS instance that I previously provisioned in my workspace. I am able to successfully configure and register the container image:
# retrieve cloud representations of the models
rf = Model(workspace=ws, name='pumps_rf')
le = Model(workspace=ws, name='pumps_le')
ohc = Model(workspace=ws, name='pumps_ohc')
print(rf); print(le); print(ohc)
<azureml.core.model.Model object at 0x7f66ab3b1f98>
<azureml.core.model.Model object at 0x7f66ab7e49b0>
<azureml.core.model.Model object at 0x7f66ab85e710>
package_list = [
'category-encoders==1.3.0',
'numpy==1.15.0',
'pandas==0.24.1',
'scikit-learn==0.20.2']
# Conda environment configuration
myenv = CondaDependencies.create(pip_packages=package_list)
conda_yml = 'file:'+os.getcwd()+'/myenv.yml'
with open(conda_yml,"w") as f:
f.write(myenv.serialize_to_string())
Configuring and registering the image works:
# Image configuration
image_config = ContainerImage.image_configuration(execution_script='score.py',
runtime='python',
conda_file='myenv.yml',
description='Pumps Random Forest model')
# Register the image from the image configuration
# to Azure Container Registry
image = ContainerImage.create(name = Config.IMAGE_NAME,
models = [rf, le, ohc],
image_config = image_config,
workspace = ws)
Creating image
Running....................
SucceededImage creation operation finished for image pumpsrfimage:2, operation "Succeeded"
Attaching to an existing cluster also works:
# Attach the cluster to your workgroup
attach_config = AksCompute.attach_configuration(resource_group = Config.RESOURCE_GROUP,
cluster_name = Config.DEPLOY_COMPUTE)
aks_target = ComputeTarget.attach(workspace=ws,
name=Config.DEPLOY_COMPUTE,
attach_configuration=attach_config)
# Wait for the operation to complete
aks_target.wait_for_completion(True)
SucceededProvisioning operation finished, operation "Succeeded"
However, when I try to deploy the image to the existing cluster, it fails with a WebserviceException
.
# Set configuration and service name
aks_config = AksWebservice.deploy_configuration()
# Deploy from image
service = Webservice.deploy_from_image(workspace = ws,
name = 'pumps-aks-service-1' ,
image = image,
deployment_config = aks_config,
deployment_target = aks_target)
# Wait for the deployment to complete
service.wait_for_deployment(show_output = True)
print(service.state)
WebserviceException: Unable to create service with image pumpsrfimage:1 in non "Succeeded" creation state.
---------------------------------------------------------------------------
WebserviceException Traceback (most recent call last)
<command-201219424688503> in <module>()
7 image = image,
8 deployment_config = aks_config,
----> 9 deployment_target = aks_target)
10 # Wait for the deployment to complete
11 service.wait_for_deployment(show_output = True)
/databricks/python/lib/python3.5/site-packages/azureml/core/webservice/webservice.py in deploy_from_image(workspace, name, image, deployment_config, deployment_target)
284 return child._deploy(workspace, name, image, deployment_config, deployment_target)
285
--> 286 return deployment_config._webservice_type._deploy(workspace, name, image, deployment_config, deployment_target)
287
288 @staticmethod
/databricks/python/lib/python3.5/site-packages/azureml/core/webservice/aks.py in _deploy(workspace, name, image, deployment_config, deployment_target)
Any ideas on how to solve this issue? I am writing the code in a Databricks notebook. Also, I am able to create and deploy the cluster using Azure Portal no problem so this appears to be an issue with my code/Python SDK or the way Databricks works with AMLS.
UPDATE: I was able to deploy my image to AKS using Azure Portal and the webservice works as expected. This means the issue lies somewhere between Databricks, the Azureml Python SDK and Machine Learning Service.
UPDATE 2: I'm working with Microsoft to fix this issue. Will report back once we have a solution.
Upvotes: 2
Views: 1292
Reputation: 2904
In my initial code, when creating the image, I was not using:
image.wait_for_creation(show_output=True)
As a consequence, I was calling CreateImage
and DeployImage
before the image was created which errored out. Can't believe it was that simple..
UPDATED IMAGE CREATION SNIPPET:
# Register the image from the image configuration
# to Azure Container Registry
image = ContainerImage.create(name = Config.IMAGE_NAME,
models = [rf, le, ohc],
image_config = image_config,
workspace = ws)
image.wait_for_creation(show_output=True)
Upvotes: 2
Reputation: 461
Agreed on Arvid's answer. Were you able to succesfully run it? You can also try and deploy it to ACI, but if the problem is in the score.py, you'd have the same issue but it's quick to try. Also, a bit more painful but if you want to debug the deployment, but you can expose port tcp 5678 on your local docker deployment and use VSCode and PTVSD to connect to it and debug step by step.
Upvotes: 0
Reputation: 455
From personal experience I would say that the error message you see might suggest that there is some error with the script inside the image. Such errors doesn't necessary prevent the image from being created successfully, but it might prevent the image from being used in a service. However, if you have successfully been able to deploy the image in other services, then you should be able to rule out this option.
You can follow this guide for more information on how to debug the Docker image locally, as well as finding logs and other useful information.
Upvotes: 1