Reputation: 5793
I'm trying to use AWS Machine Learning batch processes from a python project. I'm using boto3. I am getting this failure message in the response.
There was an error trying to parse the schema: \'Can not deserialize instance of boolean out of START_ARRAY token\n at [Source: java.io.StringReader@60618eb4; line: 1, column: 2] (through reference chain: com.amazon.eml.dp.recordset.SchemaPojo["dataFileContainsHeader"])\
The .csv file I am using works. I know this because it worked through the console process.
Here is my code; it is a function within a django model which holds the url to the file to be processed (input_file):
def create_data_source_from_s3(self):
attributes = []
attribute = { "fieldName": "Var1", "fieldType": "CATEGORICAL" }
attributes.append(attribute)
attribute = { "fieldName": "Var2", "fieldType": "CATEGORICAL" }
attributes.append(attribute)
attribute = { "fieldName": "Var3", "fieldType": "NUMERIC" }
attributes.append(attribute)
attribute = { "fieldName": "Var4", "fieldType": "CATEGORICAL" }
attributes.append(attribute)
attribute = { "fieldName": "Var5", "fieldType": "CATEGORICAL" }
attributes.append(attribute)
attribute = { "fieldName": "Var6", "fieldType": "CATEGORICAL" }
attributes.append(attribute)
dataSchema = {}
dataSchema['version'] = '1.0'
dataSchema['dataFormat'] = 'CSV'
dataSchema['attributes'] = attributes
dataSchema["targetFieldName"] = "Var6"
dataSchema["dataFileContainsHeader"] = True,
json_data = json.dumps(dataSchema)
client = boto3.client('machinelearning', region_name=settings.region, aws_access_key_id=settings.aws_access_key_id, aws_secret_access_key=settings.aws_secret_access_key)
#create a datasource
return client.create_data_source_from_s3(
DataSourceId=self.input_file.name,
DataSourceName=self.input_file.name,
DataSpec={
'DataLocationS3': 's3://' + settings.AWS_S3_BUCKET_NAME + '/' + self.input_file.name,
'DataSchema': json_data,
},
ComputeStatistics=True
)
Any ideas what I'm doing wrong?
Upvotes: 0
Views: 257
Reputation: 10101
Remove the comma
dataSchema["dataFileContainsHeader"] = True,
This is causing Python to think that you are adding a tuple. So your dataSchema actually contains (True, )
and your output looks like this
{"dataFileContainsHeader": [true], "attributes": [{"fieldName": "Var1", "fieldType": "CATEGORICAL"}, {"fieldName": "Var2", "fieldType": "CATEGORICAL"}, {"fieldName": "Var3", "fieldType": "NUMERIC"}, {"fieldName": "Var4", "fieldType": "CATEGORICAL"}, {"fieldName": "Var5", "fieldType": "CATEGORICAL"}, {"fieldName": "Var6", "fieldType": "CATEGORICAL"}], "version": "1.0", "dataFormat": "CSV", "targetFieldName": "Var6"}
AWS is instead expecting something like this
"dataFileContainsHeader": true
Upvotes: 2