Gesh
Gesh

Reputation: 575

Get page numbers of searchresult of a pdf in solr

I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page.

So what I need is the page number and a short text snippet of every search result.

I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.

I found this here "Indexing PDF with page numbers with Solr" but it wasn't really helpfully.

Upvotes: 3

Views: 2329

Answers (4)

aswath86
aswath86

Reputation: 571

I have not tried it myself. Approach,

  1. Solr customer connector integrating with Apache Tika parser for indexing PDFs
  2. Create multiple attributes in Solr like page1, page2, page3…,pageN – Alternatively, can use dynamic attributes in Solr
  3. In the customer connector, read the PDFs, page by page, index them onto the respective page attributes/dynamic attributes
  4. Enable search on all the “page” attributes
  5. When user searches, use the “highlighter/Summary/Teaser” component to only retrieve “page” attributes that has hits
  6. The “page” attributes that has a hit (find from highlighter/Summary/Teaser) for a given records are the pages that has the searched phrase.
  7. Link the PDF with the “#PageNumber” of the PDF and pop up the page on click

A far better approach compared to splitting the PDFs and indexing them as separate Solr docs.

If you find a flaw in this design, respond to my thread. I will attempt to resolve it.

Upvotes: 0

Mayank Vij
Mayank Vij

Reputation: 1

I also tried getting the results with page number but could not do it. I used Apache PDFBox for splitting all the PDFs present in a directory and sending the files to Solr server.

Upvotes: 0

Gesh
Gesh

Reputation: 575

I'm now splitting the PDF and sending each page separately to SOLR. So every page is an own document with an id <id_of_document>_<page_number> and an additional field doc_id which contains only the <id_of_document> for grouping the results.

Upvotes: 2

Jayendra
Jayendra

Reputation: 52799

There is JIRA SOLR-380 with a Patch, which you can check upon.

Upvotes: 0

Related Questions