Reputation: 1582
I'm creating a new search system for my application. The idea now is to use query relaxation in order to get wider results from the database and then calculate it's relevance to the user. The problem is in the algorithm. I'm considering using something like nearest neighbor algorithm but I'm a little uncertain on how to use it.
How can I get the relevance, in %, of a record in databse, to the user search?
I need to do this operation in the attributes distance and category. In other words, when I'm querying the DB, the distance is multiplied by 2 and category is relaxed by selecting it's parent category.
An example: if the user searches for something that is up to 30km away and the category is 'soccer', I'll get from the DB all the records up till 60km and 'ballSports' (in a tree like: sports->fullContact->ballSports->soccer, so I'd get sports like soccer, football, rugby, and so on).
This % also needs to be calculated having in mind the weight of the attribute for the user. If the user considers category more important than distance, this has to be taken in to account when calculating the relevance.
A good example of a category tree and a formula to calculate distances can be found here on page 3: http://reference.kfupm.edu.sa/content/d/i/a_distributed_case_based_reasoning_appli_58512.pdf
How can I apply that formula to attributes? BTW, I'm using MongoDB so all data is in the document, no relations to other tables.
Thank you
Upvotes: 2
Views: 324
Reputation: 7038
I'm starting with the assumption that for search results you are using classic relational database and table have flat structure like following:
| categoryId | latitude | longitude | parentCategoryId |
So, relaxing category based on parent category could be simple tree search of children nodes based on parent node of category entered by user (given that you tree already in memory). You can do using sql join on categories table but from my experience its better to live algorithmic stuff to java - its easier to test/refactor and you're getting wide variety of algorithms with predictive time/space complexity. Sqls on the other side could give you a bit of headache with execution plan cost which sometimes differs dramatically between different db providers.
How to get relevance in % to user based on distance and multiple category match? What to show first - football or ruby if user entered soccer?
Well, that's a really good question and I don't know good answer but what I'd do is use existing data from google search in the next way: Given that user entered soccer (child of ball sport category)
Google search results rank you could easy pre-calculate programatically, or retrieve dynamically (I would't do it dynamically unless you plan to change categories very often)
As user I'd be happy with this output, let me know what you think :-)
EDIT: I've read paper and looks like in your case formula for similarity could be simplified to calculating similarity between two words.
One way to do this is to get google rank for word gram of two categories ('soccer rugby' will give you '199,000,000' and 'soccer football' will give you '441,000,000'). It looks good enough.
Why I am so obsessed with google rank? These guys have zillions of data based on sport web sites, articles and their data relevant to you domain problem. In case of guys form paper(Western Air Ltd.) - their data is specific to their internal domain and they have to work our similarity using their domain (like number of features, importance weighting of each feature, etc)
Upvotes: 2