galex
galex

Reputation: 136

Google Cloud Document AI OCR - Different number of words and tokens

I'm using Google Document AI OCR to extract the text from and image following this guide.

I'm using this image: Test image

This is what I'm doing:

from google.cloud import documentai_v1 as documentai
from google.api_core.client_options import ClientOptions
from typing import Optional, Sequence

def process_document(project_id: str, location: str,
                     processor_id: str, file_path: str,
                     mime_type: str) -> documentai.Document:

    documentai_client = documentai.DocumentProcessorServiceClient(
        client_options=ClientOptions(
            api_endpoint=f"{location}-documentai.googleapis.com"
        )
    )

    resource_name = documentai_client.processor_path(
        project_id, location, processor_id)

    with open(file_path, "rb") as image:
        image_content = image.read()

        raw_document = documentai.RawDocument(
            content=image_content, mime_type=mime_type)

        request = documentai.ProcessRequest(
            name=resource_name, raw_document=raw_document)

        result = documentai_client.process_document(request=request)

        return result.document


def main():
    project_id = 'abc'
    location = 'eu'
    processor_id = 'abc'

    file_path = 'orig.png'
    mime_type = 'image/png'

    document = process_document(project_id=project_id, location=location,
                                processor_id=processor_id, file_path=file_path,
                                mime_type=mime_type)

    print("Tokens:", len(document.pages[0].tokens))
    print("Words:", len(document.text.split()))
    print("Words:", document.text.split())

But the result is not what I'm expecting:

Tokens: 10
Words: 7
Words: ['Hello', 'World.', 'Using', "Tesseract's", 'OCR.', 'From', 'srcmake.']

So, basically, I have more tokens than words. By having a look at the following image I can see commas and dots are considered tokens as well. But is there a general way of treating document.text to get the same amount of words and tokens?

Upvotes: 0

Views: 119

Answers (1)

Holt Skinner
Holt Skinner

Reputation: 2234

Tokens have a specific meaning in Document Understanding/Optical Character Recognition, and they may not always line up directly with what humans perceive as words.

If you want to separate the text by words, what you are doing here would work, but the output structure for Token can't be changed for the API.

    print("Words:", len(document.text.split()))
    print("Words:", document.text.split())

Upvotes: 0

Related Questions