Reputation: 6639

Performing Optical Character Recognition on PDF's from ColdFusion using a Java or .NET Library?

I am looking to take a PDF and extract any text from it. I then want to make it available using ColdFusion's available Verity search to search the contents.

Are there any libraries out there that do this quite well already? I am including Java or .NET (Java prefered) libraries in the scope since they can be called from CF.

Any insights or experiences would be greatly appreciated... thanks!

Edit: Indexing PDF files works when the text is embedded in the PDF as far as I know with CF. The PDFs I'm having to deal with have the text scanned as an image.

Upvotes: 1

Answers (4)

Jas Panesar

Reputation: 6639

On a semi related note, I found a very neat post about encoding and reading 2D Matrix barcodes in coldfusion.

http://www.stillnetstudios.com/2007/12/15/2d-barcodes-coldfusion/

This might solve some of my issues in needing to extract encoded information, but I am still after the body of the text.

Regarding tessnet, found a .net version too. http://www.pixel-technology.com/freeware/tessnet2/ If I could natively feed in PDF's instead of TIFFs.. :)

Upvotes: 0

Peter Boughton

Reputation: 112200

If you have the ability to run your own software (i.e. Dedicated/VPS) then you could investigate using Tesseract OCR with cfexecute to convert the PDFs to text?

Upvotes: 1

Peter Boughton

Reputation: 112200

Ray Camden has an eight-part series on working with PDFs in ColdFusion 8.

Part 7 of the series covers using DDX to get text out of a PDF.

Not sure this will work with your OCR needs though, but may still be worth looking at.

Upvotes: 0