Reputation: 32933
I'm using MYSQL's FULL TEXT search functionality (in Mysql 5.6.33).
If I do a MATCH in NATURAL LANGUAGE mode, for a postcode, with a one-character typo, i get some decent results back, including the results with the "right" postcode, but they're not near the top.
For example, there are 10 schools with postcode "BN2 1TL"
. I deliberately misspell this as "BN2 1TM"
and do a search as follows:
SELECT record_id, address_string,
MATCH (address_string) AGAINST ("BN2 1TM" IN NATURAL LANGUAGE MODE) AS score
FROM schools
WHERE MATCH (address_string) AGAINST ("BN2 1TM" IN NATURAL LANGUAGE MODE) > 0
ORDER BY score DESC;
On closer inspection, it's because the search has bought back all results that have either "BN2"
or "1TM"
in their address_string
column, and they all have exactly the same score, and so are in random order, effectively. .
This is perfectly reasonable behaviour, but it would be great if I could get the score to take "closeness" into account, meaning that, for a search on "BN2 1TM"
, "BN2 1TL"
would be scored more highly than "BN2 3PQ"
. Is there a way to do this?
EDIT: I remembered that this type of closeness is technically called "Levenshtein distance", which is a reference to the Levenshtein algorithm for determining how many replacements are needed to turn one string into another. So I guess my question could be like "Can i get MYSQL FULLTEXT NATURAL LANGUAGE MODE scoring to take Levenshtein distance into account"?
Upvotes: 4
Views: 1292
Reputation: 108676
First off, MySQL fulltext is not nearly as good at open-ended searching as dedicated systems like Lucene.
There's an algorithm, called the Levenshtein distance, that computes the number of transformations of characters -- the distance -- to change one string into another.
So, changing "BN2 1TM" into "BN2 1MT" (a transposition) has a distance of 2. Changing it into "BN2 1TX" has a distance of 1.
Levenshtein distance isn't terribly useful for phrases unless they're almost exactly the same. Changing "Apache Sphinx" into "MySQL FULLTEXT" gives a distance of 14, the length of the longer string. But it's useful for postcodes, partnumbers, and other short structured words.
You could try something like this to get the nearest values first.
SELECT city, county, postcode
FROM table
ORDER BY levenshtein(postcode, 'BN2 1MT') ASC
Then, all you need is a stored function to compute Levenshtein distances. (This isn't built into FULLTEXT.)
From this source, here is such a stored function. But beware, it's not fast and it cannot use an index. So if you can narrow down the search before you do this, you'll get better performance.
DELIMITER $$
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1),
s2_len = CHAR_LENGTH(s2),
cv1 = 0x00,
j = 1,
i = 1,
c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END$$
DELIMITER ;
Upvotes: 4