Reputation: 575
I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page.
So what I need is the page number and a short text snippet of every search result.
I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result.
I found this here "Indexing PDF with page numbers with Solr" but it wasn't really helpfully.
Upvotes: 3
Views: 2329
Reputation: 571
I have not tried it myself. Approach,
A far better approach compared to splitting the PDFs and indexing them as separate Solr docs.
If you find a flaw in this design, respond to my thread. I will attempt to resolve it.
Upvotes: 0
Reputation: 1
I also tried getting the results with page number but could not do it. I used Apache PDFBox for splitting all the PDFs present in a directory and sending the files to Solr server.
Upvotes: 0
Reputation: 575
I'm now splitting the PDF and sending each page separately to SOLR.
So every page is an own document with an id <id_of_document>_<page_number>
and an additional field doc_id which contains only the <id_of_document>
for grouping the results.
Upvotes: 2