Ankit A
Ankit A

Reputation: 11

Google Document AI does not return textStyle and font information for any document

I am using Document AI services to OCR scanned and machine-generated PDF documents. I have tested with 10 different documents but none of them returned with textStyle properties (it is always empty).

Just wanted to make sure if that feature is really supported and works or is mentioned in the documentation just to showcase.

textStyle information is really critical for our business use-case. So the earliest response will be really appreciated.

I am using default Google python example code

from google.api_core.client_options import ClientOptions
from google.cloud import documentai_v1 as documentai

# TODO(developer): Uncomment these variables before running the sample.
# project_id = 'YOUR_PROJECT_ID'
# location = 'YOUR_PROCESSOR_LOCATION' # Format is 'us' or 'eu'
# processor_id = 'YOUR_PROCESSOR_ID' #  Create processor in Cloud Console
# file_path = '/path/to/local/pdf'
# mime_type = 'application/pdf' # Refer to https://cloud.google.com/document-ai/docs/processors-list for supported file types


def quickstart(
    project_id: str, location: str, processor_id: str, file_path: str, mime_type: str
):
    # You must set the api_endpoint if you use a location other than 'us', e.g.:
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project_id/locations/location/processor/processor_id
    # You must create new processors in the Cloud Console first
    name = client.processor_path(project_id, location, processor_id)

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

    # Load Binary Data into Document AI RawDocument Object
    raw_document = documentai.RawDocument(content=image_content, mime_type=mime_type)

    # Configure the process request
    request = documentai.ProcessRequest(name=name, raw_document=raw_document)

    result = client.process_document(request=request)

    # For a full list of Document object attributes, please reference this page:
    # https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document
    document = result.document

    # Read the text recognition output from the processor
    print("The document contains the following text:")
    print(document.text)

Upvotes: 1

Views: 890

Answers (1)

Holt Skinner
Holt Skinner

Reputation: 2234

Currently, the textStyles attribute is listed as a "Placeholder" in the Documentation, which means it might be populated by processors in the future, or it can be used for end user data storage.

You mention

textStyle information is really critical for our business use-case.

Could you provide some context of your use case?

Upvotes: 0

Related Questions