Search names inside a long text using Lucene

Question

I have Lucene index contains names like:

douglas adams
adams sandlers
adams

etc..

When I want to search a name, it's fairly easy. But, I have some messages that I need to search to check if it contains any of these names. And they are fairly long like:

Radio producer Dirk Maggs had consulted with Adams, first in 1993, and later in 1997 and 2000 about creating a third radio series, based on the third novel in the Hitchhiker's series.[21] They also discussed the possibilities of radio adaptations of the final two novels in the five-book "trilogy". As with the movie, this project was only realised after Adams's death. The third series, The Tertiary Phase, was broadcast on BBC Radio 4 in September 2004 and was subsequently released on audio CD. With the aid of a recording of his reading of Life, the Universe and Everything and editing, Adams can be heard playing the part of Agrajag posthumously. So Long, and Thanks for All the Fish and Mostly Harmless made up the fourth and fifth radio series, respectively (on radio they were titled The Quandary Phase and The Quintessential Phase) and these were broadcast in May and June 2005, and also subsequently released on Audio CD. The last episode in the last series (with a new, "more upbeat" ending) concluded with, "The very final episode of The Hitchhiker's Guide to the Galaxy by Douglas Adams is affectionately dedicated to its author.

Problem is this is the message and I need to form a query, or a group of queries, and need to find the names that are indexed.

I tried looking each term separately but it produces lots of false positives, find all names that contains any of the terms.

For above text, it should match with "adams" entry, and also "douglas adams" entry, but not "adams sandlers" . As you see, it's like looking the opposite way, like searching each entry inside the text, but unfortunately its the opposite.

Does someone knows how to deal with it ?? I'm not expecting an exact solution but any idea would be appreciated.

Rushik · Accepted Answer

Here is a fairly simple approach.

1) Index all your names in Lucene (you've already done this)
2) Fire entire phrase as a query (field: Radio producer Dirk Maggs .......)
3) Get all matched documents/results from Lucene and post process them (you will get doughlas adams, adams sandlers, adams as your top docs)
4) During post processing start with each of matched document, take each term of document and match thru each term of your query, if all terms of your document are found in query consider this document ELSE discard the document (by doing this you are discarding "adam sandlers") - this will be O(n^2) execution.
5) Done

#4 will be little expensive and it can be optimized if you have execution time problem.

I am not sure very sure how to add custom post processing logic in Solr but I am sure its possible.

You can also create your custom Collector and add this logic there but if you have large number of documents your execution will be super slow.

Search names inside a long text using Lucene

Answers (1)

Related Questions