Alex R
Alex R

Reputation: 11881

How to get Lucene scoring to account for words not specified in search terms?

There is probably a name for what I'm asking and it has something to do with Bayesian statistics.

I have a database of street addresses and I'm using Lucene to match user-entered addresses (if you need an analogy, pretend I work for Google Maps).

Given that both "West North Avenue" and "West North Shore Avenue" are valid street names, how can I get Lucene to score "2000 West North Avenue" higher than "1000 West North Shore Avenue" when searching for "1000^0.001 West North Avenue"?

The 1000^0.001 means, the number should be used to break a tie, but otherwise matching the street name is more important than matching the right number to the wrong street.

Unfortunately in this example, the 1000^0.001 causes the wrong match (North Shore) to get ahead of the correct one.

What scoring algorithm would enable Lucene to adjust the score downwards for failure to specify an indexed term in the search, with rare terms weighing more than common terms?

Upvotes: 1

Views: 34

Answers (1)

Persimmonium
Persimmonium

Reputation: 15771

I would solve this by carefully tokenizing street names. For instance, you could do this:

  1. extract the number and the street name to two different fields street_nb, street_nm. And index them separately.
  2. now use two clauses for your query, one, targeting street_nb is MUST,and the other SHOULD. So you make sure the street name alone will match, and then if the name matches, even better.
  3. you can do different things besides this, like using phrases to force a perfect match on the street name etc. Play around with the variants till it gives you good results.

Upvotes: 1

Related Questions