Reputation: 91
I have analyzed my handwritten letter in documentai and have gotten the response. I want to further process the output such that i want to label each block as cluster123 and each line as label123 and also have the indentation characteristics of such block as follow:
[cluster1][line1][left indented] This line is for address
[cluster1][line2][left indented] Dear sir!
[cluster2][line1][centered] APPLICATION FOR THE POST OF A SALES REP
[cluster][line1][paragraph1] FIRST LINE OF BODY OF THE LETTER
[cluster][line2][normal] SECOND LINE OF BODY OF THE LETTER
.
.
[cluster][line1][paragraph2] HERE IS ANOTHER BODY OF THE LETTER
.
[cluster][line4][right indented] Yours sincerely
Here is the sample of code i have written to address this issue, but my syntax crashed because I can not access the line of each block using block['lines'] argument.
response_dict = documentai.Document.to_json(response)
data = json.loads(response_dict)
# load the json output from the Document AI API
with open('document_ai_output.json', 'r') as f:
data = json.load(f)
# extract the pages from the json output
pages = data['pages']
# iterate through each page
for page in pages:
# extract the blocks from the page
blocks = page['blocks']
# iterate through each block
for i, block in enumerate(blocks):
# label the block as "cluster i+1"
block_label = "cluster {}".format(i+1)
print("Block label:", block_label)
# extract the bounding box coordinates for the block
bounding_box = block['layout']['boundingPoly']['vertices']
x_coords = [vertex['x'] for vertex in bounding_box]
y_coords = [vertex['y'] for vertex in bounding_box]
# calculate the average x-coordinate of the bounding box
avg_x = sum(x_coords) / len(x_coords)
# determine the justification of the block
if avg_x < 200:
justification = "left-justified"
elif avg_x > 600:
justification = "right-justified"
else:
justification = "centered"
print("Justification:", justification)
# extract the lines from the block
lines = block['layout']['lines']
# iterate through each line
for j, line in enumerate(lines):
# label the line as "line j+1"
line_label = "line {}".format(j+1)
print("Line label:", line_label)
# extract the start and end indices for the line text
start_index = line['layout']['startIndex']
end_index = line['layout']['endIndex']
# extract the line text
line_text = line['text'][start_index:end_index]
# print("Line text:", line_text)
print(f'{block_label}{justification}{line_label}:{line_text}')
how to achieve this?
Upvotes: 1
Views: 512
Reputation: 2232
This Code Sample for the Document OCR Processor shows how to extract all of the Document text structure fields on a processed document.
Note - All processors should include this information in the Document
object output, so this sample should work for all processors.
https://cloud.google.com/document-ai/docs/handle-response#code_samples
For your specific issue, it looks like you're trying to access the lines
element as a child of blocks
but it should be a child of the Page
element.
Upvotes: 1