Luis
Luis

Reputation: 114

Is there a way to show pdf in its original structure in the human review custom entity labelling in aws sagemaker?

I have modified this sample to read PDFs in tabular format. I would like to keep the tabular structure of the original pdf when doing the human review process. I notice the custom worker task template uses the crowd-entity-annotation element which seems to read only texts. I am aware that the human reviewer process reads from an S3 key which contains raw text written by the textract process.

I have been considering writing to S3 using tabulate but I don't think that is the best solution. I would like to keep the structure and still have the ability to annotate custom entities.

Upvotes: 0

Views: 397

Answers (1)

yinxiaoz-amzn
yinxiaoz-amzn

Reputation: 26

Comprehend now natively support to detect custom-defined entities for pdf documents. To do so, you can try the following steps:

  1. Follow this github readme to start the annotation process for PDF documents.
  2. Once the annotations are produced. You can use Comprehend CreateEntityRecognizer API to train a custom entity model for Semi-structured document”
  3. Once entity recognizer is trained, you can use StartEntitiesDetectionJob API to run inference for PDF documents

Upvotes: 1

Related Questions