Reputation: 131
I am using Alfresco community 6.1.
I have thousands of invoices to scan, OCR them (near 100% recognition) and retrieve the needed metadata (Partner, Invoice Number, Amount, Units,Currency,...).(All of this in Alfresco)
Based on these metadata retrieved i need to do some operations on the invoices ( Move them to appropriate folders, apply some workflows...).
As a first approche:
For the OCR I used Alfresco Simple OCR Action, but the result is not very accurate (far from 100%).
For retrieving the results I convert the PDF OCRed to a plain text file and then i search it's content using javascript with document.content ... But since the OCR is not accurate i can't tell if it's the best solution to search inside the document.
So my questions are :
How can I make the OCR results more accurate?
How to retrieve important data from the invoice? is the method i'm using good enough or very poor for such processing?
Im using pdfsandwich, and my alfresco-global.properties is:
ocr.command=/usr/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o
ocr.extra.commands=-verbose -lang eng
ocr.server.os=linux
Upvotes: 1
Views: 1317
Reputation: 1339
To answer your questions.
To improve OCR results you need to pre-process image. That includes noise removal, line removal, thresholding, etc. But none of them helps if the engine is not working precisely. Tesseract from version 4.0.0 is working well enough for most applications.
Your approach may work in some cases but it will not work great on a large set of invoices. I suggest using some of the invoice data extraction services. In that case, you don't need to worry about preprocessing and extraction itself. You could use:
Using such a service can save you a lot of headaches and time.
Disclaimer: I am one of the creators of typless. Feel free to suggest edits.
Upvotes: 0
Reputation: 2707
I'm afraid this question is off topic: https://stackoverflow.com/help/on-topic
Some input anyway:
Upvotes: 2