J L
J L

Reputation: 420

In GCP's DocumentAI, when importing documents via API, is it possible to add a Document Type label?

I am creating a Custom Document Classification Processor in GCP's DocumentAI platform, and am trying to understand whether it is possible to assign a Document Type label to documents when importing them to train the Processor.

This StackOverflow answer notes that GCP's DocumentAI platform does expose an API to create a Dataset and upload documents to it. With that in mind, I know that it is possible to use the DocumentAI API to create a dataset, and then (as in the code below) to update that Dataset's schema with document types:

from google.cloud import documentai_v1beta3 as documentai

document_processor_service_client = documentai.DocumentProcessorServiceClient()

processor_name = 'projects/123456789/locations/us/processors/example123'

processor = document_processor_service_client.get_processor(documentai.GetProcessorRequest(name=processor_name))

dataset_schema = document_service_client.get_dataset_schema(documentai.GetDatasetSchemaRequest(name=f'{processor.name}/dataset/datasetSchema'))
dataset_schema

dataset_schema.document_schema.entity_types = [
    {
    "name": "test1",
    "base_types": ["document"],
    "entity_type_metadata": {
    },
    "display_name": "test1"
  },
  {
    "name": "test2",
    "base_types": ["document"],
    "entity_type_metadata": {
    },
    "display_name": "test2
  },
    {
    "name": "test4",
    "base_types": ["document"],
    "entity_type_metadata": {
    },
    "display_name": "test4"
  }
]

update_schema_request = document_service_client.update_dataset_schema(documentai.UpdateDatasetSchemaRequest(dataset_schema=dataset_schema))

I know that the API also allows importing one or more documents, as in this code:

import_documents_request = document_service_client.import_documents(
    documentai.ImportDocumentsRequest(
        dataset=f"{processor.name}/dataset",
        batch_documents_import_configs=[
            documentai.ImportDocumentsRequest.BatchDocumentsImportConfig(
                auto_split_config=documentai.ImportDocumentsRequest.BatchDocumentsImportConfig.AutoSplitConfig(
                    training_split_ratio=0.7
                ),
                batch_input_config=documentai.BatchDocumentsInputConfig(
                    gcs_documents=documentai.GcsDocuments(
                        documents=[
                            documentai.GcsDocument(
                                gcs_uri="gs://path/to/document.pdf",
                                mime_type="application/pdf",
                            )
                        ]
                    )
                ),
            )
        ],
    ),
)

When manually uploading documents in Cloud Console, there is an option for applying a Document Type label to all imported documents:

Screenshot of "Import documents" interface in Cloud Console

I can't tell from the DocumentAI documentation: Is it possible to similarly assign a Document Type label to one or more Documents via the API? Whether during upload, or after? I have a lot of documents ready to use in a training set, and just need to give each an overall Document Type label (vs. annotating specific fields in each document), so I am looking for a way to do so programmatically, rather than manually.

Upvotes: 2

Views: 559

Answers (3)

A Foster
A Foster

Reputation: 1

You can also run your documents through an OCR Processor, get the document objects from said processor and then edit the document objects as a Json. For example, for labeling a splitter processor values, I edited the entities key in the document object Json with the following values:

{
    "entities": []
        [{
            "confidence": 1, 
            "pageAnchor": {
                "pageRefs": [{}]}, 
            "type": "<document_type>"
        }, {
            "confidence": 1, 
            "pageAnchor": {
                "pageRefs": [
                    {"page": "1"}, 
                    {"page": "2"}
                ]
            }, 
            "type": "<document_type>"},
        {
            "confidence": 1, 
            "pageAnchor": {
                "pageRefs": [{}]}, 
            "type": "<document_type>"
        }, {
            "confidence": 1, 
            "pageAnchor": {
                "pageRefs": [
                    {"page": "1"}, 
                    {"page": "2"}
                ]
            }, 
            "type": "<document_type>"}], 
    "pages": [...]
}

Upvotes: 0

Holt Skinner
Holt Skinner

Reputation: 2234

The Document AI API does not currently support applying a label on import when using the importDocuments() method. You need to use the Cloud Console to do bulk labeling.

I would recommend adding more details to the public issue tracker nestor-ceniza-jr@ created so that this can be prioritized by the product development team.

https://issuetracker.google.com/303285767

Upvotes: 2

Nestor
Nestor

Reputation: 1377

The documentation did not explicitly state the support of labeling task through API requests, but is missing listed the options on how to label your documents:

  • Manual: manually label your documents in the Google Cloud console

  • Auto-labeling: use an existing processor version to generate labels

  • Document labeling tasks: lets you outsource document labeling to a team of labeling specialists

  • Import pre-labeled documents: save time if you already have labeled documents

It seems auto labeling is to be done by console's UI too, If applicable I would suggest apply labeling tasks to a labeling specialists option where you can add instructions and add pools for your specialists.

For the meantime I have created a feature request for this,for the visibility of other users too that may be searching for the same feature and gain traction (you may add details to the thread too): https://issuetracker.google.com/303285767

Upvotes: 2

Related Questions