Rossa Urs
Rossa Urs

Reputation: 17

MySQL longest common substring

Does anybody have a MySQL function for longest common substring (LCS)? I found a function here, but in SQL. As a self-taught programmer I don’t know much MySQL, but working on art-with-language project.

Upvotes: 0

Views: 1687

Answers (1)

Michael - sqlbot
Michael - sqlbot

Reputation: 179124

MySQL is arguably not the most appropriate place to implement functions for string manipulation, and typically we need questions to show some effort on the desired code. I'm being a little flexible on this one, since you did at least find a reference to what you're trying to do and ask whether MySQL has a native capability.

It doesn't.

You also asked if the example code could be rewritten for MySQL.

It can't. It appears to rely on capabilities not implemented in MySQL Server.

However... the question intrigued me, and I like doing unusual things in MySQL (sometimes, it's just nice to be able to do something in the database, even though it's not necessarily the most efficient place, and sometimes it's quite arguably the wrong place, but still handy)... so instead of a shower, I had a bath this morning, and during that time, I came up with this:

DELIMITER $$

DROP FUNCTION IF EXISTS `longest_common_substring` $$
CREATE FUNCTION `longest_common_substring`(short_str TEXT, long_str TEXT) RETURNS text CHARSET utf8
    NO SQL
    DETERMINISTIC
BEGIN
-- http://stackoverflow.com/questions/35545281/mysql-longest-common-substring
DECLARE short_len INT DEFAULT CHAR_LENGTH(short_str);
DECLARE long_len INT DEFAULT CHAR_LENGTH(long_str);
DECLARE swap_str TEXT;

DECLARE max_matched_len INT DEFAULT 0;
DECLARE max_at_left_marker INT DEFAULT NULL;
DECLARE max_at_match_len INT DEFAULT NULL;
DECLARE left_marker INT DEFAULT 0;
DECLARE match_len INT DEFAULT NULL;

IF short_str IS NULL OR long_str IS NULL THEN
  RETURN NULL;
ELSEIF short_str = long_str THEN
  RETURN short_str;
END IF;

IF short_len > long_len THEN
  SET swap_str = long_str;
  SET long_str = short_str;
  SET short_str = swap_str;
  SET short_len = long_len;
  SET long_len = CHAR_LENGTH(long_str);
END IF;

left_loop:
LOOP
  SET left_marker = left_marker + 1;
  IF left_marker + max_matched_len > short_len THEN
    LEAVE left_loop;
  END IF;
  SET match_len = max_matched_len;
  right_loop:
  LOOP
    SET match_len = match_len + 1;
    IF 1 - left_marker + match_len > short_len THEN
      LEAVE right_loop;
    END IF;
    IF long_str LIKE CONCAT ('%',SUBSTRING(short_str FROM left_marker FOR match_len),'%') THEN
      SET max_matched_len = match_len, max_at_left_marker = left_marker;
    ELSE
      LEAVE right_loop;
    END IF;
  END LOOP;
END LOOP;

IF (max_matched_len) THEN
  RETURN SUBSTRING(short_str FROM max_at_left_marker FOR max_matched_len);
ELSE
  RETURN NULL;
END IF;

END $$

DELIMITER ;

It appears to work correctly.

mysql> SELECT longest_common_substring('Lions are growing like yellow roses on the wind','and we turn gracefully in the medieval garden of their roaring blossoms');
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| longest_common_substring('Lions are growing like yellow roses on the wind','and we turn gracefully in the medieval garden of their roaring blossoms') |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| n the                                                                                                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT longest_common_substring('die, bart, die','sideshow bob dislikes bart simpson');
+---------------------------------------------------------------------------------+
| longest_common_substring('die, bart, die','sideshow bob dislikes bart simpson') |
+---------------------------------------------------------------------------------+
|  bart                                                                           |
+---------------------------------------------------------------------------------+
1 row in set (0.01 sec)

There are some caveats --

If there is more than one "longest" substring match, that is, if there are two (or more) "longest" substring matches of exactly the same length, the first match will be the one that is returned.

This code does not consider spaces or punctuation as being any less significant than other characters, so in the second example, above, the answer is actually 5 characters, ' bart' with a leading space. Arguably, a substring with 5 non-space characters would be a better match, if it existed, and this code would not find it, since the first match is used and no subsequent matches are considered unless they are longer. It could be modified to be more sophisticated, but this is essentially a proof of concept.

mysql> SELECT longest_common_substring('bart and lisa','bart or lisa');
+----------------------------------------------------------+
| longest_common_substring('bart and lisa','bart or lisa') |
+----------------------------------------------------------+
| bart                                                     |
+----------------------------------------------------------+
1 row in set (0.00 sec)

...however, if a shorter match is a candidate but an unconnected but longer one follows, the results are definitely as expected.

mysql> SELECT longest_common_substring('bart and maggie','bart or maggie');
+--------------------------------------------------------------+
| longest_common_substring('bart and maggie','bart or maggie') |
+--------------------------------------------------------------+
|  maggie                                                      |
+--------------------------------------------------------------+
1 row in set (0.00 sec)

How it works:

We take two arguments, expecting the shorter string first. If the longer string is first, that's fine, because we swap them in memory, but it costs us a little bit of processing time.

Then we drag a temporary substring -- a specifically-crafted fragment of the short string -- across the long string, checking whether the long string is LIKE % + our temporary substring + %. If not, we move to the next character. If so, we broaden the temporary substring by 1 character until we no longer have a match -- but while we were in possession of a match, we saved its position and length, and use this information as a subsequent optimization to avoid unnecessary comparisons that can't possibly yield a better match.

I may add this to https://github.com/sqlbot/dtsl-mysql, my collection of date, time, and string manipulation functions, once I'm ready to release it.

Upvotes: 4

Related Questions