Reputation: 2250
We have to index books, each book is split into chapters and chapters are split into pages (pages represent original page cutting in printed version).
We should show the result grouped by books and chapters (for the same book) and pages (for the same chapter).
As far as I know, we have 2 options:
index pages as SOLR documents. In this way we could theoretically retrieve chapters (and books?) using grouping but:
we will miss matches across two contiguous pages (page cutting is only due to typographical needs so concepts could be split... as in printed books)
I don't know if it is possible in SOLR to group results on two different levels (books and chapters)
index chapters as SOLR documents. In this case we will have the right matches but how to obtain, for example, the list of pages containing a match (or part of it)? (we need pages because the client can only display pages)
Upvotes: 1
Views: 286
Reputation: 636
I have always gone with the option to make each page a Solr document.
When I parse the digital version of a book, I capture information on which page numbers belong to a given chapter, work out how many pages the chapter is in length, and assign an id of some kind to each chapter. Since each page becomes a Solr document, that information has to be repeated for each page's manifest, which also includes overall book metadata like title, creator, publication date, etc. None of that is done in Solr itself, but with shell scripts as prep before Solr indexing. Sometimes I store all that metadata in a database, sometimes in a file on disk. Finally, I produce a manifest per page in Solr add/update XML so it's easy to for Solr to ingest.
When I query Solr, I use fq={!collapse field=<chapter-id-field> nullPolicy=expand}
so that in search results, only the most relevant page of a chapter comes back to be presented to the user. The nullPolicy=expand
attribute allows search results that aren't book chapters come back normally, without collapsing, which is important when I put together a search index consisting of a variety of sources.
From the user's point of view, they get a "chapter" hit, and that chapter is only going to be represented once across their search results. In the results UI, I make it clear that "this chapter is x pages long, the best match for your search was found on page y". The UI includes a document viewer, so I give the user the choice to jump straight to page y (the most relevant) in the document viewer, or start reading the chapter at its beginning. And, of course, I could give them the option to read the entire book from its beginning, too.
As for worrying about concepts split across pages, I don't. I find that most people search for single words or small phrases. I'm sure there are cases where a search phrase is split across two Solr documents, but we are talking about books here: large bodies of content in which key terms and phrases tend to be repeated.
Upvotes: 3