Anthony Palmer
Anthony Palmer

Reputation: 972

MySQL select maximum length of matching string

I need to return all the text results(s), if any, that share the maximum length left bounded substring common to the search string.

Given a search for "StackOverflow" in a table column containing

"Stack",
"Sta", 
"StackOv", 
"StackOverthrow",
"StackOverSlow",
"StackFlow", 
"Soverflow",
"StackOverCrow",
"StackOverSlow",
etc. 

the query would return "StackOverthrow" as it contains the greatest number of matching characters, as well as StackOverSlow and StackOverCrow in a unique result set. Currently Im doing something inefficient which is to start with a LIKE search for the first characters and to continue repeating and extending the search string until nothing is found, and keeping the last good result.

i.e.

select names from table where name like 'XX%';


 "S" ->Results
 "St"->Results
 . .
 "StackOver"->Results 
 "StackOverf"-> No results (Last result returning items beginning with StackOver etc  as being the correct answer)

I know that this approach is extremely inefficient, can anyone provide a single query to achieve this result? I know I could search for all combinations at once and filter for the longest results in code, however, I think the DB should be better at this.

Edit1: Note the example above is somewhat of a simplification. The vast majority of data in the DB is between 2 and 10 chars, with the most common match length of about 3 chars. There are upwards of 100K records in the table.

Edit2: Apologies, I needed to clarify that there may be more than one correct result, and that the results can contain duplicates that need to be removed. Currently with my inefficient method selecting distinct is easy.

Upvotes: 2

Views: 2215

Answers (3)

eggyal
eggyal

Reputation: 125865

With an index on name, the following ought to be extremely performant:

SELECT DISTINCT name
FROM   myTable
WHERE  name LIKE CASE
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'S%') THEN '%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'St%') THEN 'S%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'Sta%') THEN 'St%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'Stac%') THEN 'Sta%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'Stack%') THEN 'Stac%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackO%') THEN 'Stack%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOv%') THEN 'StackO%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOve%') THEN 'StackOv%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOver%') THEN 'StackOve%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverf%') THEN 'StackOver%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverfl%') THEN 'StackOverf%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverflo%') THEN 'StackOverfl%'
  WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverflow%') THEN 'StackOverflo%'
  ELSE 'StackOverflow%'
END

See it on sqlfiddle.

Upvotes: 3

Buggabill
Buggabill

Reputation: 13901

You could do a query after creating a Levenshtein Distance stored function. This could get the best matched results for you.

This is not my code. I got this from here. It does seem to test well on sqlfiddle.

CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
  RETURNS INT
  DETERMINISTIC
  BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR;
    -- max strlen=255
    DECLARE cv0, cv1 VARBINARY(256);
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
    IF s1 = s2 THEN
      RETURN 0;
    ELSEIF s1_len = 0 THEN
      RETURN s2_len;
    ELSEIF s2_len = 0 THEN
      RETURN s1_len;
    ELSE
      WHILE j <= s2_len DO
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
      END WHILE;
      WHILE i <= s1_len DO
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
        WHILE j <= s2_len DO
          SET c = c + 1;
          IF s1_char = SUBSTRING(s2, j, 1) THEN 
            SET cost = 0; ELSE SET cost = 1;
          END IF;
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
          IF c > c_temp THEN SET c = c_temp; END IF;
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
            IF c > c_temp THEN 
              SET c = c_temp; 
            END IF;
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
        END WHILE;
        SET cv1 = cv0, i = i + 1;
      END WHILE;
    END IF;
    RETURN c;
  END;

Your query could then look something like this:

SELECT names, levenshtein(`names`, 'StackOverflow') as dist
FROM mytable
ORDER BY dist;

Here is what this looks like over on sqlfiddle.

The results would look like this with the lowest distance being the closest match:

NAMES           DIST
StackOverthrow  3
StackFlow       4
Soverflow       4
StackOv         6
Stack           8
Sta             10

Upvotes: 1

DRapp
DRapp

Reputation: 48139

Don't know why you would look at the smallest first. I would do that in reverse... try by the LONGEST EXACT match first, and if not found, work backwards 1 character at a time until one is found.

Upvotes: 0

Related Questions