How should I index a book in Solr?

Question

I have a PDF file of the book I want to index but I want to be able to tell which chapter(and even the sentence) the word came from in the book. How can I do that in Solr as I'm not sure the correct way to go about this from the docs. How would I do it if it weren't a PDF file but a text file for a book?

Alexandre Rafalovitch · Accepted Answer

You cannot do that easily with PDFs. If you have access to ePub versions, you job would be a lot simpler.

PDF (unless it has accessibility layer) does not preserve the text flow, so you will have real problem determining text itself, never mind the chapters, etc.

The problem is not with Solr (yet), but with basic content extraction from PDF. Look at Apache Tika and see how much information it can extract. If that's not enough, you need to use something other than PDF.

How should I index a book in Solr?

Answers (1)

Related Questions