Reputation: 6639
I am looking to take a PDF and extract any text from it. I then want to make it available using ColdFusion's available Verity search to search the contents.
Are there any libraries out there that do this quite well already? I am including Java or .NET (Java prefered) libraries in the scope since they can be called from CF.
Any insights or experiences would be greatly appreciated... thanks!
Edit: Indexing PDF files works when the text is embedded in the PDF as far as I know with CF. The PDFs I'm having to deal with have the text scanned as an image.
Upvotes: 1
Views: 3078
Reputation: 6639
On a semi related note, I found a very neat post about encoding and reading 2D Matrix barcodes in coldfusion.
http://www.stillnetstudios.com/2007/12/15/2d-barcodes-coldfusion/
This might solve some of my issues in needing to extract encoded information, but I am still after the body of the text.
Regarding tessnet, found a .net version too. http://www.pixel-technology.com/freeware/tessnet2/ If I could natively feed in PDF's instead of TIFFs.. :)
Upvotes: 0
Reputation: 112200
If you have the ability to run your own software (i.e. Dedicated/VPS) then you could investigate using Tesseract OCR with cfexecute
to convert the PDFs to text?
Upvotes: 1
Reputation: 112200
Ray Camden has an eight-part series on working with PDFs in ColdFusion 8.
Part 7 of the series covers using DDX to get text out of a PDF.
Not sure this will work with your OCR needs though, but may still be worth looking at.
Upvotes: 0