Haradzieniec
Haradzieniec

Reputation: 9340

select rows with longest substring of the string

Let me describe the problem based on the example below. Lets say there is a string "abc12345" (could be any!!!) and there is a table mytable with a column mycolumn of varchar(100).

There are some rows that ends with the last character 5.
There are some rows that ends with the last characters 45.
There are some rows that ends with the last characters 345
There are no rows that ends with the last characters 2345.

In this case these rows should be selected:

SELECT * FROM mytable WHERE mycolumn LIKE "%345"

That's because "345" is the longest right substring of "abc12345" that occurs at least once as the right substring of at least one string in the mycolumn column. Any ideas how to write it in one query? Thank you.

Upvotes: 4

Views: 455

Answers (3)

Sasha Pachev
Sasha Pachev

Reputation: 5326

If you cannot restructure the table I would approach the problem this way:

  • Write an aggregate UDF LONGEST_SUFFIX_MATCH(col, str) in C (see an example in sql/udf_example.c in the MySQL source, search for avgcost)

  • SELECT @longest_match:=LONGEST_SUFFIX_MATCH(mycol, "abcd12345") FROM mytbl; SELECT * FROM mytbl WHERE mycol LIKE CONCAT('%', SUBSTR("abcd12345", -@longest_match))

If you could restructure the table, I do not have a complete solution yet, but the first thing I would add a special column mycol_rev obtained by reversing the string (via REVERSE() function) and create a key on it, then use that key for lookups. Will post a full solution when I have a moment.

Update:

If you can add a reversed column with a key on it:

  • use the query in the format of `SELECT myrevcol FROM mytbl WHERE myrevcol LIKE CONCAT(SUBSTR(REVERSE('$search_string'), $n),'%') LIMIT 1 performing a binary search with respect to $n over the range from 1 to the length of $search_string to find the largest value of $n for which the query returns a row
  • SELECT * FROM mytbl WHERE myrevcol LIKE CONCAT(SUBSTR(REVERSE('$search_string'), $found_n),'%')

This solution should be very fast as long as you do not have too many rows coming back. We will have a total of O(log(L)) queries where L is the length of the search string each of those being a B-tree search with the read of just one row followed by another B-tree search with the index read of only the needed rows.

Upvotes: 1

Marcin Zukowski
Marcin Zukowski

Reputation: 4729

Interesting puzzle :)

The hardest problem here is finding what is the length of the target suffix matching your suffix pattern.

In MySQL you probably need to use either generating series or a UDF. Others proposed these already.

In PostgreSQL and other systems that provide regexp-based substring, you can use the following trick:

select v,
    reverse(
      substring(
        reverse(v) || '#' || reverse('abcdefg')
        from '^(.*).*#\1.*'
    )) res
from table;

What it does is:

  • constructs a single string combining your string and suffix. Note, we reverse them.
  • we put # in between the strings that's important, you need a character that doesn't exist in your string.
  • we extract a match from a regular expression, using substring, such that
    • it starts at the beginning of the string ^
    • matches any number of characters (.*)
    • can have some remaining characters .*
    • now we find #
    • now, we want the same string we matched with (.*) to be present right after #. So we use \1
    • and there can be some tail characters .*
    • we reverse the extracted string

Once you have the longest suffix, finding maximum length, and then finding all strings having the suffix of that length is trivial.

Here's a SQLFiddle using PostgreSQL:

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269953

This is a brute force method:

select t.*
from (select t.*,
             dense_rank() over (order by (case when mycolumn like '%abc12345' then 1
                                               when mycolumn like '%bc12345' then 2
                                               when mycolumn like '%c12345' then 3
                                               when mycolumn like '%12345' then 4
                                               when mycolumn like '%2345' then 5
                                               when mycolumn like '%345' then 6
                                               when mycolumn like '%45' then 7
                                               when mycolumn like '%5' then 8
                                         end)
                              ) as seqnum
      where mycolumn like '%5' -- ensure at least one match  
      from t
     ) t
where seqnum = 1;

This then inspires something like this:

select t.*
from (select t.*, max(i) over () as maxi
      from t join
           (select str, generate_series(1, length(str)) as i
            from (select 'abc12345' as str) s
           ) s
           on left(t.mycolumn, i) = left(str, i)
     ) t
where i = maxi;

Upvotes: 2

Related Questions