Reputation: 1382
I am training a model using Google's Document AI. The training fails with the following error (I have included only a part of the JSON file for simplicity but the error is identical for all documents in my dataset):
"trainingDatasetValidation": {
"documentErrors": [
{
"code": 3,
"message": "Invalid document.",
"details": [
{
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
"reason": "INVALID_DOCUMENT",
"domain": "documentai.googleapis.com",
"metadata": {
"num_fields": "0",
"num_fields_needed": "1",
"document": "5e88c5e4cc05ddb8.json",
"annotation_name": "INCOME_ADJUSTMENTS",
"field_name": "entities.text_anchor.text_segments"
}
}
]
}
What I understand from this error is that the model expects the field INCOME_ADJUSTMENTS
to appear (at least) once in the document but instead, it finds zero instances of it.
That would have been understandable except I have already defined the field INCOME_ADJUSTMENTS
in my schema as "Optional Once", i.e., this field can appear either zero or one time.
Am I missing something? Why does this error persist despite the fact that it is addressed in the schema?
p.s. I have also tried "Optional multiple" (and "Required once" and "Required multiple") and the error persists.
EDIT: As requested, here's what one of the JSON files looks like. Note that there is no PII here as the details (name, SSN, etc.) are synthetic data.
Upvotes: 6
Views: 1506
Reputation: 1617
This bug may have been fixed. I am now seeing an error message "Cannot create labels with empty values" on images that previously generated the error.
Upvotes: 0
Reputation: 21
I had this problem with "internal error" when I had bounding boxes that intersected each other. Check your definitions and remove any labels that have boxes crossing each other. The error does not give any hints to what document has the problem, unfortunately, so you might have to scroll through them all.
Also, I had at some bounding boxes on empty fields. I do not know if this affected the error, but I also removed then along with the intersecting boxes.
After this, I could run the training process without errors.
Upvotes: 0
Reputation: 81
I have/had the same issue as you in the past and also having it right now.
What I managed to do was to get the document string from the error message and then searching for the images in the Storage bucket that has the dataset.
Then I opened the image and searched for that image in my 1000+ images dataset.
Then I deleted the bounding box for the label with the issue and then relabeled it. This seemed to solve 90%of the issues I had.
It`s a ton of manual work and I wish google thought of more when they released the Web app for Doc AI because the ML part is great but the app is really lackluster.
I would also be very happy for any other fixes
EDIT: another quicker workaround I have found is deleting the latest revision of the labeled documents from the Dataset in cloud storage. Like, take faulty document name from the operation json dump, then search for it in documents/ and then just delete latest revision.
Will probably mess up labeling and make you lose work, but it`s a quick fix to at least make some progress if you want.
Upvotes: 3
Reputation: 15
i had the same problem. so i deleted all my dataset and imported and re-labeled again. then the training worked fine.
Upvotes: -2