Reputation: 31
I have ran into problem when trying to implement fulltext search. To me it seams like math/statistics more then anything. The data pulled from database is book titles, so the scores returned by the query could have very close values(example: 9.98; 9.97; 9.78 - which are all very relevant results) or wide spread(example: 9.99; 8.2; 2.1 - the first two are relevant the third is noise). I can't figure out how to manipulate the query result to remove irrelevant. Std deviation doesn't work, because it filters good results in my first example, various normalization methods will either omit relevant results or include irrelevant. Any thoughts or ideas, please.
Thanks. Victor
Upvotes: 3
Views: 195
Reputation: 4210
I was just working on a problem much like this, but with time-based data rather than fulltext. I found the 68-95-99.7 rule, which among other things points out that in a true bell curve about 95% of the results are within 2 standard deviations of the mean. I took this knowledge and decided to throw out 5% of the results as outliers. You could do similarly -- omit the 5% of fulltext results having the lowest relevancy scores.
Another option might be to choose a certain threshold relevancy score, or a certain minimum number of results you want to show. Or both -- you could display by whichever criteria yields more results.
Upvotes: 1