Reputation: 954
Trying to find out if you can use multiple files for your dataset in Amazon Sagemaker BlazingText.
I am trying to use it in Text Classification mode.
It appears that it's not possible, certainly not in File mode, but wondering about whether Pipe mode supports it. I don't want to have all my training data in 1 file, because if it's generated by an EMR cluster I would need to combine it afterwards which is clunky.
Thanks!
Upvotes: 3
Views: 683
Reputation: 181
You are right in that File mode doesn't support multiple files (https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html).
Pipe mode would in theory work but there are a few caveats:
{"source":"linux ready for prime time ", "label":1}
{"source":"bowled by the slower one ", "label":2}
and then you have to pass the _ AttributeNames_ argument to the createTrainingJob SageMaker API (it is all explained in the link above).
In order to use Pipe mode, you would need to modify your EMR job to generate Augmented Manifest format, and you could only use one label per sentece.
At this stage, concatenating the files generated by your EMR job into a single file seems like the best option.
Upvotes: 1