Reputation: 111
This is the first time I am using amazon web services to deploy my machine learning pre-trained model. I want to deploy my pre-trained TensorFlow model to Aws-Sagemaker. I am somehow able to deploy the endpoints successfully But whenever I call the predictor.predict(some_data)
method to make prediction to invoking the endpoints it's throwing an error.
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "". See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-tensorflow-2020-04-07-04-25-27-055 in account 453101909370 for more information.
After going through the cloud watch logs I found this error.
#011details = "NodeDef mentions attr 'explicit_paddings' not in Op<name=Conv2D; signature=input:T, filter:T -> output:T; attr=T:type,allowed=[DT_HALF, DT_BFLOAT16, DT_FLOAT, DT_DOUBLE]; attr=strides:list(int); attr=use_cudnn_on_gpu:bool,default=true; attr=padding:string,allowed=["SAME", "VALID"]; attr=data_format:string,default="NHWC",allowed=["NHWC", "NCHW"]; attr=dilations:list(int),default=[1, 1, 1, 1]>; NodeDef: {{node conv1_conv/convolution}} = Conv2D[T=DT_FLOAT, _output_shapes=[[?,112,112,64]], data_format="NHWC", dilations=[1, 1, 1, 1], explicit_paddings=[], padding="VALID", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](conv1_pad/Pad, conv1_conv/kernel/read). (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).
I don't know where I am wrong and I have wasted 2 days already to solve this error and couldn't find out the information regarding this. The detailed logs I have shared here.
Tensorflow version of my notebook instance is 1.15
Upvotes: 2
Views: 1015
Reputation: 111
After a lot of searching and try & error, I was able to solve this problem. In many cases, the problem arises because of the TensorFlow and Python versions.
Cause of the problem:
To deploy the endpoints, I was using the TensorflowModel
on TF 1.12 and python 3 and which exactly caused the problem.
sagemaker_model = TensorFlowModel(model_data = model_data, role = role, framework_version = '1.12', entry_point = 'train.py')
Apparently, TensorFlowModel
only allows python 2 on TF version 1.11, 1.12. 2.1.0.
How I fixed it: There are two TensorFlow solutions that handle serving in the Python SDK. They have different class representations and documentation as shown here.
Python 3 isn't supported using the TensorFlowModel
object, as the container uses the TensorFlow serving API library in conjunction with the GRPC client to handle making inferences, however, the TensorFlow serving API isn't supported in Python 3 officially, so there are only Python 2 versions of the containers when using the TensorFlowModel
object.
If you need Python 3 then you will need to use the Model
object defined in #2 above.
Finally, I used the Model
with the TensorFlow version 1.15.1.
sagemaker_model = Model(model_data = model_data, role = role, framework_version='1.15.2', entry_point = 'train.py')
Also, here are the successful results.
Upvotes: 4