Reputation: 1470
I am trying to reproduce the output of the "Document Text Detection" sample UI uploader through the Google Vision API. However, the output I am getting from the sample code is only providing individual characters as an output, when I require words to be grouped together.
Is there a feature within the library that allows grouping by "words" instead from the DOCUMENT_TEXT_DETECT endpoint or or image.detect_full_text()
function in Python?
I am not looking for full text extraction as my .jpg files are not visually structured in a way that the image.detect_text()
function satisfies.
Google's Sample Code:
def detect_document(path):
"""Detects document features in an image."""
vision_client = vision.Client()
with io.open(path, 'rb') as image_file:
content = image_file.read()
image = vision_client.image(content=content)
document = image.detect_full_text()
for page in document.pages:
for block in page.blocks:
block_words = []
for paragraph in block.paragraphs:
block_words.extend(paragraph.words)
block_symbols = []
for word in block_words:
block_symbols.extend(word.symbols)
block_text = ''
for symbol in block_symbols:
block_text = block_text + symbol.text
print('Block Content: {}'.format(block_text))
print('Block Bounds:\n {}'.format(block.bounding_box))
Sample output of the off the shelf sample provided by Google:
property {
detected_languages {
language_code: "mt"
}
}
bounding_box {
vertices {
x: 1193
y: 1664
}
vertices {
x: 1206
y: 1664
}
vertices {
x: 1206
y: 1673
}
vertices {
x: 1193
y: 1673
}
}
symbols {
property {
detected_languages {
language_code: "en"
}
}
bounding_box {
vertices {
x: 1193
y: 1664
}
vertices {
x: 1198
y: 1664
}
vertices {
x: 1198
y: 1673
}
vertices {
x: 1193
y: 1673
}
}
text: "P"
}
symbols {
property {
detected_languages {
language_code: "en"
}
detected_break {
type: LINE_BREAK
}
}
bounding_box {
vertices {
x: 1200
y: 1664
}
vertices {
x: 1206
y: 1664
}
vertices {
x: 1206
y: 1673
}
vertices {
x: 1200
y: 1673
}
}
text: "M"
}
block_words
Out[47]:
[property {
detected_languages {
language_code: "en"
}
}
bounding_box {
vertices {
x: 1166
y: 1664
}
vertices {
x: 1168
y: 1664
}
vertices {
x: 1168
y: 1673
}
vertices {
x: 1166
y: 1673
}
}
symbols {
property {
detected_languages {
language_code: "en"
}
}
bounding_box {
vertices {
x: 1166
y: 1664
}
vertices {
x: 1168
y: 1664
}
vertices {
x: 1168
y: 1673
}
vertices {
x: 1166
y: 1673
}
}
text: "2"
}
Upvotes: 3
Views: 2611
Reputation: 175
There are two types in GCV: 1. Text Detection and 2. Document Text Detection
Text detection is used for detecting some text in an image. Basically it gives text values which are found in it. You cannot rely on its accuracy, for example this cannot be used to read receipts or any document data.
Whereas, document text detection is very good in accuracy and detects each minute detail from the document. In this method, words are separated from each other, for e.g. 03/12/2017 will come as 0 3 / 1 2 / etc. along with its co-ordinates. This is actually for better accuracy.
Now as per your question, you should better use first method i.e. text detection and it will provide you results with full words and its co-ordinates.
Upvotes: 0
Reputation: 811
This response is coming late. I guess you were looking for something like the below.
def parse_image(image_path=None):
"""
Parse the image using Google Cloud Vision API, Detects "document" features in an image
:param image_path: path of the image
:return: text content
:rtype: str
"""
client = vision.ImageAnnotatorClient()
response = client.text_detection(image=open(image_path, 'rb'))
text = response.text_annotations
del response
return text[0].description
the function returns complete text in the image.
Upvotes: 1