Craig Hooghiem
Craig Hooghiem

Reputation: 1282

Project Thoughts: Searching Directory of PDFs

To preface this, I know there are discussions on this in various places. Half of what I read is outdated, buggy or simply unrelated to my situation.

This is why I am bringing it to the community that I know will have the answers.

Question: I have a directory (online is ideal) of around 70,000 pages in PDF documents (documents range from 20 - 100s of pages, add up to around 70,000 pages).

I am looking for a method, script or idea for the easiest way to search these PDFs for products. The PDFs all have a text layer that was created by OCR in Acrobat.

Any ideas, whether they be elaborate or inventive, are more than welcome.

Upvotes: 1

Views: 223

Answers (3)

Mikos
Mikos

Reputation: 8543

My recommendation would be Apache Solr (a search server built using Lucene) and is dead simple to use using it RESTful interface. It also has a subproject called Tika which extracts metadata and structured text content from multiple formats (incl. PDF).

Upvotes: 2

Cfreak
Cfreak

Reputation: 19309

XPDF has a utility called pdftotext which often is installed on linux distributions. I would create a tool that uses that to create an index of words to the document they appear in. You could store the index in a database and then write a search against that.

It would take a little more space but it would be simple to include a sentence of context as well to show in the search results.

Upvotes: 2

Gordon
Gordon

Reputation: 316939

Use a search engine like Lucene or Sphinx to index and tag the PDFs. The Zend Framework has both, a component to read and write PDF files and a Lucene implementation.

Upvotes: 2

Related Questions