Max Williams
Max Williams

Reputation: 32933

Mysql Fulltext search, natural language mode: order by "closeness"

I'm using MYSQL's FULL TEXT search functionality (in Mysql 5.6.33).

If I do a MATCH in NATURAL LANGUAGE mode, for a postcode, with a one-character typo, i get some decent results back, including the results with the "right" postcode, but they're not near the top.

For example, there are 10 schools with postcode "BN2 1TL". I deliberately misspell this as "BN2 1TM" and do a search as follows:

SELECT record_id, address_string, 
  MATCH (address_string) AGAINST ("BN2 1TM" IN NATURAL LANGUAGE MODE) AS score 
  FROM schools 
  WHERE MATCH (address_string) AGAINST ("BN2 1TM" IN NATURAL LANGUAGE MODE) > 0 
  ORDER BY score DESC;

On closer inspection, it's because the search has bought back all results that have either "BN2" or "1TM" in their address_string column, and they all have exactly the same score, and so are in random order, effectively. .

This is perfectly reasonable behaviour, but it would be great if I could get the score to take "closeness" into account, meaning that, for a search on "BN2 1TM", "BN2 1TL" would be scored more highly than "BN2 3PQ". Is there a way to do this?

EDIT: I remembered that this type of closeness is technically called "Levenshtein distance", which is a reference to the Levenshtein algorithm for determining how many replacements are needed to turn one string into another. So I guess my question could be like "Can i get MYSQL FULLTEXT NATURAL LANGUAGE MODE scoring to take Levenshtein distance into account"?

Upvotes: 4

Views: 1292

Answers (1)

O. Jones
O. Jones

Reputation: 108676

First off, MySQL fulltext is not nearly as good at open-ended searching as dedicated systems like Lucene.

There's an algorithm, called the Levenshtein distance, that computes the number of transformations of characters -- the distance -- to change one string into another.

So, changing "BN2 1TM" into "BN2 1MT" (a transposition) has a distance of 2. Changing it into "BN2 1TX" has a distance of 1.

Levenshtein distance isn't terribly useful for phrases unless they're almost exactly the same. Changing "Apache Sphinx" into "MySQL FULLTEXT" gives a distance of 14, the length of the longer string. But it's useful for postcodes, partnumbers, and other short structured words.

You could try something like this to get the nearest values first.

  SELECT city, county, postcode
    FROM table
   ORDER BY levenshtein(postcode, 'BN2 1MT') ASC

Then, all you need is a stored function to compute Levenshtein distances. (This isn't built into FULLTEXT.)

From this source, here is such a stored function. But beware, it's not fast and it cannot use an index. So if you can narrow down the search before you do this, you'll get better performance.

DELIMITER $$
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
    RETURNS INT
    DETERMINISTIC
    BEGIN
        DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
        DECLARE s1_char CHAR;
        -- max strlen=255
        DECLARE cv0, cv1 VARBINARY(256);

        SET s1_len = CHAR_LENGTH(s1), 
            s2_len = CHAR_LENGTH(s2), 
            cv1 = 0x00, 
            j = 1, 
            i = 1, 
            c = 0;

        IF s1 = s2 THEN
            RETURN 0;
        ELSEIF s1_len = 0 THEN
            RETURN s2_len;
        ELSEIF s2_len = 0 THEN
            RETURN s1_len;
        ELSE
            WHILE j <= s2_len DO
                SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
            END WHILE;
            WHILE i <= s1_len DO
                SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
                WHILE j <= s2_len DO
                    SET c = c + 1;
                    IF s1_char = SUBSTRING(s2, j, 1) THEN
                        SET cost = 0; ELSE SET cost = 1;
                    END IF;
                    SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                    IF c > c_temp THEN SET c = c_temp; END IF;
                    SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                    IF c > c_temp THEN
                        SET c = c_temp;
                    END IF;
                    SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
                END WHILE;
                SET cv1 = cv0, i = i + 1;
            END WHILE;
        END IF;
        RETURN c;
    END$$
DELIMITER ;

Upvotes: 4

Related Questions