When training and testing a Document AI project, what influences the f1score?

Question

Using the cloud console I trained a model using only one field (to avoid the UI bug that was stopping training altogether) on one set of data. The model f1-scored 0.306 on 50 training images and 50 test images.

I added 150 training images, which were predominantly auto-labelled, most fairly correctly in terms of identifying the location but hit and miss on accurate text conversion.

I deployed the model and it scored at 0.17.

I am currently reviewing the auto-trained labels and confirming them or adjusting them (this improved the score to 0.357 so it seems the right step). Is it worthwhile to correct the text translation as well? I understand that the "Human in the Loop" step would potentially provide feedback to the system, but that these fields are not exported back to the OCR?
I intend to also increase the testing set. Is it correct that if I correct the OCR value, it will be used in the testing score? Will it be sent back to the system for updating future translations?
Is the size and shape of the box that is identified part of the f-score in this product? If so, would select text with minor tweaks provide the best match to what the AI already is looking for? Many of my early boxes were by "Add Bounding Box" and were designed to fit the possible space that handwriting is expected (e.g. include the whitespace around the captured text).

Thank you

When training and testing a Document AI project, what influences the f1score?

Answers (1)

Related Questions