adeshina Ibrahim
adeshina Ibrahim

Reputation: 91

How can i return each line of blocks in the documentai-ocr processor response

I have analyzed my handwritten letter in documentai and have gotten the response. I want to further process the output such that i want to label each block as cluster123 and each line as label123 and also have the indentation characteristics of such block as follow:

[cluster1][line1][left indented] This line is for address
[cluster1][line2][left indented] Dear sir!

[cluster2][line1][centered] APPLICATION FOR THE POST OF A SALES REP


[cluster][line1][paragraph1] FIRST LINE OF BODY OF THE LETTER


[cluster][line2][normal]    SECOND LINE OF BODY OF THE LETTER
.
.
[cluster][line1][paragraph2] HERE IS ANOTHER BODY OF THE LETTER


.
[cluster][line4][right indented]  Yours sincerely

Here is the sample of code i have written to address this issue, but my syntax crashed because I can not access the line of each block using block['lines'] argument.

response_dict = documentai.Document.to_json(response)
data = json.loads(response_dict)
# load the json output from the Document AI API

with open('document_ai_output.json', 'r') as f:
    data = json.load(f)

# extract the pages from the json output
pages = data['pages']

# iterate through each page
for page in pages:
    # extract the blocks from the page
    blocks = page['blocks']
    # iterate through each block
    for i, block in enumerate(blocks):
        # label the block as "cluster i+1"
        block_label = "cluster {}".format(i+1)
        print("Block label:", block_label)
        # extract the bounding box coordinates for the block
        bounding_box = block['layout']['boundingPoly']['vertices']
        x_coords = [vertex['x'] for vertex in bounding_box]
        y_coords = [vertex['y'] for vertex in bounding_box]
        # calculate the average x-coordinate of the bounding box
        avg_x = sum(x_coords) / len(x_coords)
        # determine the justification of the block
        if avg_x < 200:
            justification = "left-justified"
        elif avg_x > 600:
            justification = "right-justified"
        else:
            justification = "centered"
        print("Justification:", justification)
        # extract the lines from the block
        lines = block['layout']['lines']
        # iterate through each line
        for j, line in enumerate(lines):
            # label the line as "line j+1"
            line_label = "line {}".format(j+1)
            print("Line label:", line_label)
            # extract the start and end indices for the line text
            start_index = line['layout']['startIndex']
            end_index = line['layout']['endIndex']
            # extract the line text
            line_text = line['text'][start_index:end_index]
            # print("Line text:", line_text)
            print(f'{block_label}{justification}{line_label}:{line_text}')

how to achieve this?

Upvotes: 1

Views: 512

Answers (1)

Holt Skinner
Holt Skinner

Reputation: 2232

This Code Sample for the Document OCR Processor shows how to extract all of the Document text structure fields on a processed document.

Note - All processors should include this information in the Document object output, so this sample should work for all processors.

https://cloud.google.com/document-ai/docs/handle-response#code_samples

For your specific issue, it looks like you're trying to access the lines element as a child of blocks but it should be a child of the Page element.

Upvotes: 1

Related Questions