Simon Fredsted
Simon Fredsted

Reputation: 976

Solr PDF search: "Go to page" function

We're building a PDF search machine with Solr and Lucene where users can search for text in PDFs. The database only contains PDFs.

In the search results page ("/browse") we want to append the PDF file with #page=X where X is the page the text was found on. (Adobe Acrobat automatically scrolls to a certain page if specified with an anchor tag.)

For example, if I search for foobar and there's a pdf document where foobar is on page 5, the link should be http://pdfserver/pdfs/pdf.pdf#page=5 (note the anchor at the end).

  1. Is this possible?
  2. How would we get this page number?

Upvotes: 4

Views: 1024

Answers (2)

Simon Fredsted
Simon Fredsted

Reputation: 976

One easy-to-implement solution I found was to use the #search parameter that Adobe Reader supports when embedded in IE.

For example:

http://pdfserver/pdfs/pdf.pdf#search=foobar

Adobe Reader then jumps to the page.

One would need to URL-encode the search terms, of course.

Upvotes: 1

omu_negru
omu_negru

Reputation: 4770

Apache tika can transform PDF files into structured data for you to feed into the solr server .

My approach to your problem would be to index each pdf per page, with extra fields linking to the chapter, text title (or absolute path, or both) and page number.Using this data you can then open the relevant document at the relevant page.

Read more about tika here : http://tika.apache.org/

Upvotes: 0

Related Questions