Reputation: 1820
I am following the gist of this tutorial:
where I am using a custom sklearn transformer to pre-process data before passing to xgboost. When I get to this point:
transformer = sklearn_preprocessor.transformer(
instance_count=1,
instance_type='ml.m4.xlarge',
assemble_with = 'Line',
accept = 'text/csv')
# Preprocess training input
transformer.transform('s3://{}/{}'.format(input_bucket, input_key), content_type='text/csv')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()
preprocessed_train = transformer.output_path
The location of the training data is S3 and there are multiple files there. I get an error that the max payload has been exceeded and it appears that you can only set up to 100MB. Does this mean that Sagemaker can not transform larger data as input into another process?
Upvotes: 2
Views: 2134
Reputation: 106
In SageMaker batch transform, maxPayloadInMB * maxConcurrentTransform cannot exceed 100MB. However, a payload is the data portion of a request sent to your model. In your case, since the input is CSV, you can set the split_type to 'Line' and each CSV line will be taken as a record.
If the batch_strategy is "MultiRecord" (the default value), each payload will have as many records / lines as possible.
If the batch_strategy is "SingleRecord", each payload will have a single CSV line and you need to ensure each line is never larger than the max_payload_size_in_MB.
In short, if the split_type is specified (not 'None'), the max_payload_size_in_MB is nothing related to the total size of your input file.
Hope this helps!
Upvotes: 1