Google Cloud Document AI OCR - Different number of words and tokens

Question

I'm using Google Document AI OCR to extract the text from and image following this guide.

I'm using this image: Test image

This is what I'm doing:

from google.cloud import documentai_v1 as documentai
from google.api_core.client_options import ClientOptions
from typing import Optional, Sequence

def process_document(project_id: str, location: str,
                     processor_id: str, file_path: str,
                     mime_type: str) -> documentai.Document:

    documentai_client = documentai.DocumentProcessorServiceClient(
        client_options=ClientOptions(
            api_endpoint=f"{location}-documentai.googleapis.com"
        )
    )

    resource_name = documentai_client.processor_path(
        project_id, location, processor_id)

    with open(file_path, "rb") as image:
        image_content = image.read()

        raw_document = documentai.RawDocument(
            content=image_content, mime_type=mime_type)

        request = documentai.ProcessRequest(
            name=resource_name, raw_document=raw_document)

        result = documentai_client.process_document(request=request)

        return result.document


def main():
    project_id = 'abc'
    location = 'eu'
    processor_id = 'abc'

    file_path = 'orig.png'
    mime_type = 'image/png'

    document = process_document(project_id=project_id, location=location,
                                processor_id=processor_id, file_path=file_path,
                                mime_type=mime_type)

    print("Tokens:", len(document.pages[0].tokens))
    print("Words:", len(document.text.split()))
    print("Words:", document.text.split())

But the result is not what I'm expecting:

Tokens: 10
Words: 7
Words: ['Hello', 'World.', 'Using', "Tesseract's", 'OCR.', 'From', 'srcmake.']

So, basically, I have more tokens than words. By having a look at the following image I can see commas and dots are considered tokens as well. But is there a general way of treating document.text to get the same amount of words and tokens?

Google Cloud Document AI OCR - Different number of words and tokens

Answers (1)

Related Questions