Reputation: 976
We're building a PDF search machine with Solr and Lucene where users can search for text in PDFs. The database only contains PDFs.
In the search results page ("/browse") we want to append the PDF file with #page=X where X is the page the text was found on. (Adobe Acrobat automatically scrolls to a certain page if specified with an anchor tag.)
For example, if I search for foobar
and there's a pdf document where foobar
is on page 5, the link should be http://pdfserver/pdfs/pdf.pdf#page=5
(note the anchor at the end).
Upvotes: 4
Views: 1024
Reputation: 976
One easy-to-implement solution I found was to use the #search
parameter that Adobe Reader supports when embedded in IE.
For example:
http://pdfserver/pdfs/pdf.pdf#search=foobar
Adobe Reader then jumps to the page.
One would need to URL-encode the search terms, of course.
Upvotes: 1
Reputation: 4770
Apache tika can transform PDF files into structured data for you to feed into the solr server .
My approach to your problem would be to index each pdf per page, with extra fields linking to the chapter, text title (or absolute path, or both) and page number.Using this data you can then open the relevant document at the relevant page.
Read more about tika here : http://tika.apache.org/
Upvotes: 0