Reputation: 1308
I need to implement a skill matching feature similar to http://venturocket.com - a candidate enters a list of skills and rates his proficiency for each. You can then search by again entering some skills and the level of expertise you are looking for. The result is a list of candidates ordered by how well their skills match your search.
Example:
Candidate 1 enters skill Java (proficiency 90) and candidate 2 enters Java (50). When I search for Java (60) candidate 2 is a closer match.
This shold also work with multiple skills.
What I'm looking for are pointers to technologies or algorithms that would help me achieve this. My current approach would be to do a range query in a database (e.g. look for Java skills between 45 and 75) and then sort on the client, but that wouldn't be very fast.
Upvotes: 3
Views: 3831
Reputation: 11061
You could treat this as an information retrieval problem and use cosine similarity.
This involves forming for each candidate a vector of what scores they entered for each tag. Unmentioned tags get a score of 0. Queries are transformed similarly, letting the user request a score for each tag, or perhaps just treating mentioned tags as high scores, etc. Using dot products and magnitudes, one can compute a similarity score between the query and each candidate; sort and choose the top highest.
Those are the broad strokes for implementing it yourself. In any serious application I'd suggest you not do that, and instead dust off something like sphinx or lucene to do it for you.
Upvotes: 0
Reputation: 21
If I was asked to implement something like this, I would start by looking at clustering algorithms.
By grouping candidates together based on how they are similar on a number of properties (skills), it would be easy to figure out what cluster of candidates most likely match your search parameters.
k-means clustering is fairly easy to use and would probably be a good place to start. http://en.wikipedia.org/wiki/K-means_clustering
There are solid implementations of k-means in most programming languages, so getting started should be fairly easy.
There's a lot of good information about cluster based filtering in Programming Collective Intelligence — http://shop.oreilly.com/product/9780596529321.do
Upvotes: 2
Reputation: 32575
Pass the value that you are checking against in as a parameter for the query and then use the Euclidean Distance (the square of the difference) to sort:
SELECT TOP 20 * -- added a TOP 20 as example, choose/limit as appropriate for your situation
FROM Candidate
ORDER BY SQUARE(Candidate.JavaProficiency - @JavaProficiency) + SQUARE(Candidate.SqlProficiency - @SqlProficiency)
For multiple traits you sum up each of the square differences.
See Wikipedia: Euclidean Distance for a bit more detail (specifically the "Squared Euclidean Distance" section). Note that this answer is actually DanRedux's (see comments/edits).
Upvotes: 4