Reputation: 954

AWS Sagemaker BlazingText Multiple Training Files

Trying to find out if you can use multiple files for your dataset in Amazon Sagemaker BlazingText.

I am trying to use it in Text Classification mode.

It appears that it's not possible, certainly not in File mode, but wondering about whether Pipe mode supports it. I don't want to have all my training data in 1 file, because if it's generated by an EMR cluster I would need to combine it afterwards which is clunky.

Thanks!

Upvotes: 3

Answers (1)

julitopower

Reputation: 181

You are right in that File mode doesn't support multiple files (https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html).

Pipe mode would in theory work but there are a few caveats:

The format expected is Augmented Manifest (https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html). This is essentially Json lines, for instance:

{"source":"linux ready for prime time ", "label":1}
{"source":"bowled by the slower one ", "label":2}

and then you have to pass the _ AttributeNames_ argument to the createTrainingJob SageMaker API (it is all explained in the link above).

With Augmented Manifest, currently only one label is supported.

In order to use Pipe mode, you would need to modify your EMR job to generate Augmented Manifest format, and you could only use one label per sentece.

At this stage, concatenating the files generated by your EMR job into a single file seems like the best option.

Upvotes: 1

AWS Sagemaker BlazingText Multiple Training Files

Answers (1)

Related Questions