Is there a way to show pdf in its original structure in the human review custom entity labelling in aws sagemaker?

Question

I have modified this sample to read PDFs in tabular format. I would like to keep the tabular structure of the original pdf when doing the human review process. I notice the custom worker task template uses the crowd-entity-annotation element which seems to read only texts. I am aware that the human reviewer process reads from an S3 key which contains raw text written by the textract process.

I have been considering writing to S3 using tabulate but I don't think that is the best solution. I would like to keep the structure and still have the ability to annotate custom entities.

yinxiaoz-amzn · Accepted Answer

Comprehend now natively support to detect custom-defined entities for pdf documents. To do so, you can try the following steps:

Follow this github readme to start the annotation process for PDF documents.
Once the annotations are produced. You can use Comprehend CreateEntityRecognizer API to train a custom entity model for Semi-structured document”
Once entity recognizer is trained, you can use StartEntitiesDetectionJob API to run inference for PDF documents

Is there a way to show pdf in its original structure in the human review custom entity labelling in aws sagemaker?

Answers (1)

Related Questions