Matthias Bohlen
Matthias Bohlen

Reputation: 279

SQL: Find rows that match closely but not exactly

I have a table inside a PostgreSQL database with columns c1,c2...cn. I want to run a query that compares each row against a tuple of values v1,v2...vn. The query should not return an exact match but should return a list of rows ordered in descending similarity to the value vector v.

Example:

The table contains sports records:

1,USA,basketball,1956
2,Sweden,basketball,1998
3,Sweden,skating,1998
4,Switzerland,golf,2001

Now when I run a query against this table with v=(Sweden,basketball,1998), I want to get all records that have a similarity with this vector, sorted by number of matching columns in descending order:

2,Sweden,basketball,1998   --> 3 columns match
3,Sweden,skating,1998  --> 2 columns match
1,USA,basketball,1956  --> 1 column matches

Row 4 is not returned because it does not match at all.

Edit: All columns are equally important. Although, when I really think of it... it would be a nice add-on if I could give each column a different weight factor as well.

Is there any possible SQL query that would return the rows in a reasonable amount of time, even when I run it against a million rows?

What would such a query look like?

Upvotes: 5

Views: 240

Answers (3)

objectNotFound
objectNotFound

Reputation: 1783

This is how I would do it ... the multiplication factors used in the case stmts will handle the importance(weight) of the match and they will ensure that those records that have matches for columns designated with the highest weight will come up top even if the other columns don't match for those particular records.

/*
-- Initial Setup 

-- drop table sport 
create table sport (id int, Country varchar(20) , sport varchar(20) , yr int )

insert into sport values 
(1,'USA','basketball','1956'),
(2,'Sweden','basketball','1998'),
(3,'Sweden','skating','1998'),
(4,'Switzerland','golf','2001')

select * from sport
*/


select * , 
        CASE WHEN Country='sweden'      then 1 else 0 end * 100 + 
        CASE WHEN sport='basketball'    then 1 else 0 end * 10  + 
        CASE WHEN yr=1998               then 1 else 0 end * 1       as Match
from sport
WHERE 
   country = 'sweden'
OR sport   = 'basketball'
OR yr      = 1998
ORDER BY Match Desc

Upvotes: 1

Bruce David Wilner
Bruce David Wilner

Reputation: 467

It might help if you wrote a stored procedure that calculates a "similarity metric" between two rows. Then your query could refer to the return value of that procedure directly rather than having umpteen conditions in the where-expression and the order-by-expression.

Upvotes: 0

Philipp
Philipp

Reputation: 2796

SELECT * FROM countries

WHERE country = 'sweden'
OR sport = 'basketball'
OR year = 1998

ORDER BY
cast(country = 'sweden' AS integer) +
cast(sport = 'basketball' as integer) + 
cast(year  = 1998 as integer) DESC

It's not beautiful, but well. You can cast the boolean expressions as integers and sum them.

You can easily change the weight, by adding a multiplicator.

cast(sport = 'basketball' as integer) * 5 + 

Upvotes: 2

Related Questions