Reputation: 1282
To preface this, I know there are discussions on this in various places. Half of what I read is outdated, buggy or simply unrelated to my situation.
This is why I am bringing it to the community that I know will have the answers.
Question: I have a directory (online is ideal) of around 70,000 pages in PDF documents (documents range from 20 - 100s of pages, add up to around 70,000 pages).
I am looking for a method, script or idea for the easiest way to search these PDFs for products. The PDFs all have a text layer that was created by OCR in Acrobat.
Any ideas, whether they be elaborate or inventive, are more than welcome.
Upvotes: 1
Views: 223
Reputation: 8543
My recommendation would be Apache Solr (a search server built using Lucene) and is dead simple to use using it RESTful interface. It also has a subproject called Tika which extracts metadata and structured text content from multiple formats (incl. PDF).
Upvotes: 2
Reputation: 19309
XPDF has a utility called pdftotext which often is installed on linux distributions. I would create a tool that uses that to create an index of words to the document they appear in. You could store the index in a database and then write a search against that.
It would take a little more space but it would be simple to include a sentence of context as well to show in the search results.
Upvotes: 2
Reputation: 316939
Use a search engine like Lucene or Sphinx to index and tag the PDFs. The Zend Framework has both, a component to read and write PDF files and a Lucene implementation.
Upvotes: 2