Reputation: 11881
There is probably a name for what I'm asking and it has something to do with Bayesian statistics.
I have a database of street addresses and I'm using Lucene to match user-entered addresses (if you need an analogy, pretend I work for Google Maps).
Given that both "West North Avenue" and "West North Shore Avenue" are valid street names, how can I get Lucene to score "2000 West North Avenue" higher than "1000 West North Shore Avenue" when searching for "1000^0.001 West North Avenue"?
The 1000^0.001 means, the number should be used to break a tie, but otherwise matching the street name is more important than matching the right number to the wrong street.
Unfortunately in this example, the 1000^0.001 causes the wrong match (North Shore) to get ahead of the correct one.
What scoring algorithm would enable Lucene to adjust the score downwards for failure to specify an indexed term in the search, with rare terms weighing more than common terms?
Upvotes: 1
Views: 34
Reputation: 15771
I would solve this by carefully tokenizing street names. For instance, you could do this:
Upvotes: 1