Reputation: 14614
I have used AWS Comprehend to train an NLP model. The prediction on the test set runs successfully, but the output file has more rows than the input:
input: 1000 rows
output: 2082 rows
Output looks like this:
predictions.json <...>
{"File": "test.csv", "Line": "0", "Classes": [{"Name": "No", "Score": 0.7022}, {"Name": "Yes", "Score": 0.2892}, {"Name": "tag", "Score": 0.0086}]}
{"File": "test.csv", "Line": "1", "Classes": [{"Name": "No", "Score": 0.6252}, {"Name": "Yes", "Score": 0.3747}, {"Name": "tag", "Score": 0.0001}]}
{"File": "test.csv", "Line": "2", "Classes": [{"Name": "No", "Score": 0.9295}, {"Name": "Yes", "Score": 0.0705}, {"Name": "tag", "Score": 0.0}]}
{"File": "test.csv", "Line": "3", "Classes": [{"Name": "No", "Score": 0.5247}, {"Name": "Yes", "Score": 0.4753}, {"Name": "tag", "Score": 0.0}]}
...
{"File": "test.csv", "Line": "2080", "Classes": [{"Name": "No", "Score": 0.8528}, {"Name": "Yes", "Score": 0.1471}, {"Name": "tag", "Score": 0.0001}]}
{"File": "test.csv", "Line": "2081", "Classes": [{"Name": "No", "Score": 0.5318}, {"Name": "Yes", "Score": 0.4682}, {"Name": "tag", "Score": 0.0}]}
Can anyone help me on how to use the output?
Upvotes: 0
Views: 613
Reputation: 1
In my case, besides UTF-8 it was also the presence of carriage return \r
in the text.
Upvotes: 0
Reputation: 859
I faced the same issue. In my case the error was because the prediction file (Test.csv in your case) was not in the specified encoding. AWS Comprehend requires - "UTF-8" Encoding.
AWS Docs Link
Upvotes: 2
Reputation: 320
One option is to split each sentence in a different file and use the whole folder as test set, fixing the option:
"InputFormat": "ONE_DOC_PER_FILE"
Other options is try to find how many '/n' are there in the dataset, the error could be this one.
Upvotes: 0