convert OCR model results csv

Question

I used DocTr to extract the fields in dummy w2 and used the following code

from doctr.io import DocumentFile
from doctr.models import ocr_predictor
model = ocr_predictor(pretrained=True,resolve_blocks=True,assume_straight_pages=True)
# PDF
doc = DocumentFile.from_images("w2.jpg")
# Analyze
result = model(doc)

The result is a Document object with a lot of details, I want to get the data out of this object for that I write this code

for page in result.pages:
for block in page.blocks:
    for line in block.lines:
        for word in line.words:
            print(word)

but it gives output without any relation between the text and value like:

Word(value='a', confidence=0.99)
Word(value='Centrol', confidence=0.53)
Word(value='number', confidence=1.0)
Word(value='d', confidence=1.0)
Word(value='Emplovee's', confidence=1.0)
Word(value='social', confidence=0.98)
Word(value='securitv', confidence=0.97)
Word(value='number', confidence=1.0)
Word(value='999-99-9999', confidence=0.99)

as we can see from the out put it listed the labels

a. Control number
d. Employee's social security number
1. Wages, tips other compensation
999-99-9999
41,770.30
...

Now we don't know which value belongs to which field, I want to construct the output like below

Control number: 
Employee's social security number: 999-99-9999
1. Wages, tips other compensation: 41 770.30

how can I parse the result object to get the desired output? I see similar problem with easyOCR package too. How can we parse the output of OCR models to build csv? I have attached the sample w2.pdf I used for testing.

convert OCR model results csv

Answers (1)

Related Questions