Reputation: 136
I'm using Google Document AI OCR to extract the text from and image following this guide.
I'm using this image: Test image
This is what I'm doing:
from google.cloud import documentai_v1 as documentai
from google.api_core.client_options import ClientOptions
from typing import Optional, Sequence
def process_document(project_id: str, location: str,
processor_id: str, file_path: str,
mime_type: str) -> documentai.Document:
documentai_client = documentai.DocumentProcessorServiceClient(
client_options=ClientOptions(
api_endpoint=f"{location}-documentai.googleapis.com"
)
)
resource_name = documentai_client.processor_path(
project_id, location, processor_id)
with open(file_path, "rb") as image:
image_content = image.read()
raw_document = documentai.RawDocument(
content=image_content, mime_type=mime_type)
request = documentai.ProcessRequest(
name=resource_name, raw_document=raw_document)
result = documentai_client.process_document(request=request)
return result.document
def main():
project_id = 'abc'
location = 'eu'
processor_id = 'abc'
file_path = 'orig.png'
mime_type = 'image/png'
document = process_document(project_id=project_id, location=location,
processor_id=processor_id, file_path=file_path,
mime_type=mime_type)
print("Tokens:", len(document.pages[0].tokens))
print("Words:", len(document.text.split()))
print("Words:", document.text.split())
But the result is not what I'm expecting:
Tokens: 10
Words: 7
Words: ['Hello', 'World.', 'Using', "Tesseract's", 'OCR.', 'From', 'srcmake.']
So, basically, I have more tokens than words. By having a look at the following image I can see commas and dots are considered tokens as well. But is there a general way of treating document.text
to get the same amount of words and tokens?
Upvotes: 0
Views: 119
Reputation: 2234
Tokens have a specific meaning in Document Understanding/Optical Character Recognition, and they may not always line up directly with what humans perceive as words.
If you want to separate the text by words, what you are doing here would work, but the output structure for Token
can't be changed for the API.
print("Words:", len(document.text.split()))
print("Words:", document.text.split())
Upvotes: 0