user3099160
user3099160

Reputation: 13

Lucene Hierarchial Taxonomy Search

I've a set of documents annotated with hierarchial taxonomy tags, E.g.

[
{
    "id": 1,
    "title": "a funny book",
    "authors": ["Jean Bon", "Alex Terieur"],
    "book_category": "/novel/comedy/new"
},
{
    "id": 2,
    "title": "a dramatic book",
    "authors": ["Alex Terieur"],
    "book_category": "/novel/drama"
},
{
    "id": 3,
    "title": "A hilarious book",
    "authors": ["Marc Assin", "Harry Covert"],
    "book_category": "/novel/comedy"
},
{
    "id": 4,
    "title": "A sad story",
    "authors": ["Gerard Menvusa", "Alex Terieur"],
    "book_category": "/novel/drama"
},
{
    "id": 5,
    "title": "A very sad story",
    "authors": ["Gerard Menvusa", "Alain Terieur"],
    "book_category": "/novel"
}]

I need to search book by "book_category". The search must return books that match the query category exactly or partially (with a defined depth threshold) and give them a different score in function of the match degree.

E.g.: query "book_category=/novel/comedy" and "depth_threshold=1" must return books with book_category=/novel/comedy (score=100%), /novel and /novel/comedy/new (score < 100%).

I tried the TopScoreDocCollector in the search, but it returns the book which book_category at least contains the query category, and gives them the same score.

How can i obtain this search function that returns also the more general category and gives different match scores to the results?

P.S.: i don't need a faced search.

Thanks

Upvotes: 1

Views: 547

Answers (2)

user3099160
user3099160

Reputation: 13

This could by a solution. But i have more than one hierarchic filed to query and i want to use the CategoryPath indexed in taxonomy. I'm using the DrillDown query:

DrillDownQuery luceneQuery = new DrillDownQuery(searchParams.indexingParams); 
luceneQuery.add(new CategoryPath("book_category/novel/comedy,'/')); 
luceneQuery.add(new CategoryPath("subject/sub1/sub2",'/')); 

In this way the search return the books how match the two category paths and their descendants. To retrieve also the ancestors i can start the drilldown from a ancestor of the requested categoryPath (retrieved from the taxonomy).

The problem is the same score for all the results. I want to override the similarity/score function in order to calculate a categoryPath lenght based score, comparing the query categoryPath with each returned document CategoryPath (book_category).

E.g.:

if(queryCategoryPath.compareTo(bookCategoryPath)==0){ 
    document.score = 1 
}else if(queryCategoryPath.compareTo(bookCategoryPath)==1){ 
    document.score = 0.9 
}else if(queryCategoryPath.compareTo(bookCategoryPath)==2){ 
    document.score = 0.8 
} and so on. 

Upvotes: 0

knutwalker
knutwalker

Reputation: 5974

There is no built-in query, that supports this reuqirement, but you can use a DisjunctionMaxQuery with multiple ConstantScoreQuerys. The exact category and the more general category can be searched by simple TermQuerys. For the sub-categories, you can use a MultiTermQuery like the RegexpQuery to match all sub-categories, if you don't know them upfront. For example:

// the exact category
Query directQuery = new TermQuery(new Term("book_category", "/novel/comedy"));
// regex, that matches one level more that your exact category
Query narrowerQuery = new RegexpQuery(new Term("book_category", "/novel/comedy/[^/]+"));
// the more general category
Query broaderQuery = new TermQuery(new Term("book_category", "/novel"));

directQuery = new ConstantScoreQuery(directQuery);
narrowerQuery = new ConstantScoreQuery(narrowerQuery);
broaderQuery = new ConstantScoreQuery(broaderQuery);

// 100% for the exact category
directQuery.setBoost(1.0F);
// 80% for the more specific category
narrowerQuery.setBoost(0.8F);
// 50% for the more general category
broaderQuery.setBoost(0.5F);

DisjunctionMaxQuery query = new DisjunctionMaxQuery(0.0F);

query.add(directQuery);
query.add(narrowerQuery);
query.add(broaderQuery);

This would give a result like:

id=3 title=a hilarious book book_category=/novel/comedy score=1.000000
id=1 title=a funny book book_category=/novel/comedy/new score=0.800000
id=5 title=A very sad story book_category=/novel score=0.500000

For a complete test case, see this gist: https://gist.github.com/knutwalker/7959819

Upvotes: 1

Related Questions