zenami
zenami

Reputation: 131

OCR a scanned file and retrieve the metadata

I am using Alfresco community 6.1.

I have thousands of invoices to scan, OCR them (near 100% recognition) and retrieve the needed metadata (Partner, Invoice Number, Amount, Units,Currency,...).(All of this in Alfresco)

Based on these metadata retrieved i need to do some operations on the invoices ( Move them to appropriate folders, apply some workflows...).

As a first approche:

So my questions are :

Im using pdfsandwich, and my alfresco-global.properties is:

ocr.command=/usr/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o

ocr.extra.commands=-verbose -lang eng
ocr.server.os=linux

Upvotes: 1

Views: 1317

Answers (2)

Jan Giacomelli
Jan Giacomelli

Reputation: 1339

To answer your questions.

To improve OCR results you need to pre-process image. That includes noise removal, line removal, thresholding, etc. But none of them helps if the engine is not working precisely. Tesseract from version 4.0.0 is working well enough for most applications.

Your approach may work in some cases but it will not work great on a large set of invoices. I suggest using some of the invoice data extraction services. In that case, you don't need to worry about preprocessing and extraction itself. You could use:

Using such a service can save you a lot of headaches and time.

Disclaimer: I am one of the creators of typless. Feel free to suggest edits.

Upvotes: 0

Heiko Robert
Heiko Robert

Reputation: 2707

I'm afraid this question is off topic: https://stackoverflow.com/help/on-topic

Some input anyway:

  • I highly recommend to do all the ocr/classification/extraction outside / before storing the pdfs in Alfresco
  • The technical term for what you're looking for is: Document Capture If you really expect to classify your scanned docs and to extract the data for inbound documents (which you can't control in structure) the solutions are quite expensive and licensed per pages/period. Market leaders are Kofax and Abbyy in that area.
  • If you can control the document structure / if the structure of the document is fix you could use quite cheaper solutions which use something like a dynamic template approach (depending on found ancor points, barcodes, regex matches). We use PDFmdx for this to automate qualified extraction.
  • Everything depends on the OCR quality. My personal opinion: the free/open source ocr components can't compete with the commercial solutions if you don't have the time, exprtise and resources to train and optimize them. Abbyy has a quite affordable CLI solution for linux (ABBYY FineReader Engine CLI for Linux) but I'm sure there are others with similar results.
  • There is a quite nice and simple solution called AutoOCR which is a REST-/SOAP-Service providing a generic, configurable interface to use several ocr engines and configurations as a service. We implemented an Alfresco integration to act as an Alfresco Transformer but since the Alfresco Transformer framework is deprecated I'd recommend to do the whole ocr and recognition stuff before storing the documents in Alfresco
  • Finally: if it is a one time approach: Try to find a service provider doing at least the ocr and maybe also the classification/extraction.

Upvotes: 2

Related Questions