Reputation: 11
Hello and thank you for reading. To put it simply, I want to perform Batch Transform on my XGBoost model that I made using SageMaker Experiments. I trained my model on csv data stored in S3, deployed an endpoint for my model, successfully hit said endpoint with single csv lines and got back expected inferences.
(I followed this tutorial to the letter before starting to work on Batch Transformation)
Now I am attempting to run Batch Transformation using the model created from the above tutorial and I'm running into an error (skip to the bottom to see my error logs). Before I get straight to the error, I want to show my batch transform code.
(imports are done from SageMaker SDK v2.24.4)
import sagemaker
import boto3
from sagemaker import get_execution_role
from sagemaker.model import Model
region = boto3.Session().region_name
role = get_execution_role()
image = sagemaker.image_uris.retrieve('xgboost', region, '1.2-1')
model_location = '{mys3info}/output/model.tar.gz'
model = Model(image_uri=image,
model_data=model_location,
role=role,
)
transformer = model.transformer(instance_count=1,
instance_type='ml.m5.xlarge',
strategy='MultiRecord',
assemble_with='Line',
output_path='myOutputPath',
accept='text/csv',
max_concurrent_transforms=1,
max_payload=20)
transformer.transform(data='s3://test-s3-prefix/short_test_data.csv',
content_type='text/csv',
split_type='Line',
join_source='Input'
)
transformer.wait()
short_test_data.csv
33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown
47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown
33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown
35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown
57,blue-collar,married,primary,no,52,yes,no,unknown,5,may,38,1,-1,0,unknown
32,blue-collar,single,primary,no,23,yes,yes,unknown,5,may,160,1,-1,0,unknown
53,technician,married,secondary,no,-3,no,no,unknown,5,may,1666,1,-1,0,unknown
29,management,single,tertiary,no,0,yes,no,unknown,5,may,363,1,-1,0,unknown
32,management,married,tertiary,no,0,yes,no,unknown,5,may,179,1,-1,0,unknown
38,management,single,tertiary,no,424,yes,no,unknown,5,may,104,1,-1,0,unknown
I made the above csv test data using my original dataset in my command line by running:
head original_training_data.csv > short_test_data.csv
and then I uploaded it to my S3 bucket manually.
Logs
[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=20, BatchStrategy=MULTI_RECORD
[sagemaker logs]: */short_test_data.csv: ClientError: 415
[sagemaker logs]: */short_test_data.csv:
[sagemaker logs]: */short_test_data.csv: Message:
[sagemaker logs]: */short_test_data.csv: Loading csv data failed with Exception, please ensure data is in csv format:
[sagemaker logs]: */short_test_data.csv: <class 'ValueError'>
[sagemaker logs]: */short_test_data.csv: could not convert string to float: 'entrepreneur'
I understand the concept of one-hot encoding and other methods for converting strings to numbers for usage by an algorithm like XGBoost. My problem here is that I was easily able to input the exact same format of data into a deployed endpoint and get results back without doing that level of encoding. I am clearly missing something though, so any help is greatly appreciated!
Upvotes: 0
Views: 1146
Reputation: 484
Your Batch Transform code good and does not throw out any alarms, but looking at error message, it is clearly an input format error. As silly as it may sound. I'd advice you use pandas to save off the test_data from validation set to ensure the formatting is appropriate.
You could do something like this -
data = pd.read_csv("file")
#specify columns to save from ectracted df
data.columns["choose columns"]
# save the data to csv
data.to_csv("data.csv", sep=',', index=False)
Upvotes: 1