Reputation: 972
I need to return all the text results(s), if any, that share the maximum length left bounded substring common to the search string.
Given a search for "StackOverflow" in a table column containing
"Stack",
"Sta",
"StackOv",
"StackOverthrow",
"StackOverSlow",
"StackFlow",
"Soverflow",
"StackOverCrow",
"StackOverSlow",
etc.
the query would return "StackOverthrow" as it contains the greatest number of matching characters, as well as StackOverSlow and StackOverCrow in a unique result set. Currently Im doing something inefficient which is to start with a LIKE search for the first characters and to continue repeating and extending the search string until nothing is found, and keeping the last good result.
i.e.
select names from table where name like 'XX%';
"S" ->Results
"St"->Results
. .
"StackOver"->Results
"StackOverf"-> No results (Last result returning items beginning with StackOver etc as being the correct answer)
I know that this approach is extremely inefficient, can anyone provide a single query to achieve this result? I know I could search for all combinations at once and filter for the longest results in code, however, I think the DB should be better at this.
Edit1: Note the example above is somewhat of a simplification. The vast majority of data in the DB is between 2 and 10 chars, with the most common match length of about 3 chars. There are upwards of 100K records in the table.
Edit2: Apologies, I needed to clarify that there may be more than one correct result, and that the results can contain duplicates that need to be removed. Currently with my inefficient method selecting distinct is easy.
Upvotes: 2
Views: 2215
Reputation: 125865
With an index on name
, the following ought to be extremely performant:
SELECT DISTINCT name
FROM myTable
WHERE name LIKE CASE
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'S%') THEN '%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'St%') THEN 'S%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'Sta%') THEN 'St%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'Stac%') THEN 'Sta%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'Stack%') THEN 'Stac%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackO%') THEN 'Stack%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOv%') THEN 'StackO%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOve%') THEN 'StackOv%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOver%') THEN 'StackOve%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverf%') THEN 'StackOver%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverfl%') THEN 'StackOverf%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverflo%') THEN 'StackOverfl%'
WHEN NOT EXISTS(SELECT * FROM myTable WHERE name LIKE 'StackOverflow%') THEN 'StackOverflo%'
ELSE 'StackOverflow%'
END
See it on sqlfiddle.
Upvotes: 3
Reputation: 13901
You could do a query after creating a Levenshtein Distance stored function. This could get the best matched results for you.
This is not my code. I got this from here. It does seem to test well on sqlfiddle.
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END;
Your query could then look something like this:
SELECT names, levenshtein(`names`, 'StackOverflow') as dist
FROM mytable
ORDER BY dist;
Here is what this looks like over on sqlfiddle.
The results would look like this with the lowest distance being the closest match:
NAMES DIST
StackOverthrow 3
StackFlow 4
Soverflow 4
StackOv 6
Stack 8
Sta 10
Upvotes: 1
Reputation: 48139
Don't know why you would look at the smallest first. I would do that in reverse... try by the LONGEST EXACT match first, and if not found, work backwards 1 character at a time until one is found.
Upvotes: 0