Reputation: 19444
I have a MySQL database I'm searching through. Lets say this is a database of people. When querying for a specific record, it is possible to find a match 100% on each attribute. But querying the database to find closest match on probability (closest matches on table attributes) is more of the strategy.
In this scenario, does it make sense to create a temporary table (much like a tally-sheet) to indicate what attributes match/what attributes are present? What is the typical approach to doing advanced searches on database like this?
Example (below) of a hypothetical stored Procedure
*parameters are just to exemplify how I would search. I'm not concerned how to perform my selects. Question is about approach, strategy, technique *
call FindPerson ("Brown Eyes", "Brown hair", "Height:6'1", "white", "Name:Joe" ,"weight180", "Age 34" "sex m");
RESULT TABLE
NAME AGE HEIGHT WEIGHT HAIR SKIN sex RANK_MATCH
Joe 32 6'1 180 Brown white m 1
Mike 33 6'1 179 Brown white m 2
James 31 6'0 179 Brown black m 3
Upvotes: 4
Views: 602
Reputation: 3417
The approach I would use would be to create a scoring function (your stored proc) that would evaluate the given input's standard distance from the mean.
In the proc, you would judge each criteria in a fashion similar to:
INPUT AGE: 32
calculate MEAN of AGE WHERE (sex = m): 34.5
calculate STANDARD DEVIATION of AGE WHERE (sex = m): 2.5
calculate how many STDEVs 32 is from the 34.5 (also known as z-score): 1
Repeat this process for all numeric datatypes, summing them and ORDER BY the sum.
In doing so, the following schema change would be required: height changed from foot/inch form to strictly inches.
Depending on your needs, you may also consider coming up with an arbitrary scale for sex and skin color/hair color. Of course, you may think that measures like these should NOT be factored in because of how drastically it would change the scoring function. If you chose to, you'd have to find some number that would be added to the above SUM...but it's hard because nominative variables don't translate easily into these kinds of things.
If you find that haircolor/skin color is able to be usefully transferred into say, the continous color spectrum, your scoring tidbit would be the same...color value of input vs color value of means and standard deviations.
The query that would find your matches would be something to the effect of:
SELECT
ABS(INPUT_AGE - AVG(AGE)) / STD(AGE) AS age_z,
ABS(INPUT_WT - AVG(WT)) / STD(WT) AS wt_z,
...
(age_z + wt_z + ...) AS score
FROM `table`
ORDER BY score ASC
Upvotes: 2
Reputation: 2243
Just out of my mind. You can create your own score and sort by it. Something like
SELECT `id`,
(IF(`age`=32,1,0)+IF(`height`="6'1",1,0)+...) as `score`
FROM `people`
HAVING `score` > 0
ORDER BY `score` DESC
LIMIT 10;
With this, you can handle every field with its own comparison, and also weight the individual attributes by not just add 1
but 2
or more.
But I'm quiet not sure, how performant this is.
Upvotes: 2