Reise45
Reise45

Reputation: 1203

SQL: Find longest common string between rows

I have a table T1:

    Col
   -------
  1 THE APPLE
 THE APPLE
 THE APPLE 123
 THE APPLE 12/16
 BEST THE APPLE

I want T2:

 Result
--------
 THE APPLE

I am using Redshift, Looking for some way to do a fuzzystring match in SQL. Longest length possible for column is 100 characters. At no point will I have to compare more than 25 rows.

Upvotes: 2

Views: 3527

Answers (2)

Patrick
Patrick

Reputation: 32326

This question requires a fair degree of complication to solve and it's running time will drastically increase with increasing string lengths and number of records. However, given that your table T1 is rather small, you might just manage with the below PL/pgSQL function.

Algorithm

  1. Find the shortest value in T1(col). This is the longest possible match among all records. This is the candidate string.
  2. See if the candidate is present in all other rows of T1. If so, return the present candidate to the result set.
  3. Move the candidate over one position in the shortest value and back to step 2 until the candidate reaches the end of the shortest string.
  4. If matching candidates have been found, return from the function. Otherwise, shorten the candidate by 1 and start over from the beginning of the shortest string and move to step 2. If no more candidates can be extracted from the shortest string, then return NULL.

Code

The important thing in the code below is the short-circuit in the check for a match: as soon as a single record does not match col to the candidate string there is no need to check further. So for long strings, the comparison is really from the shortest string with one other string, increasing the rows examined only when candidate strings get so short that they are indeed more ubiquitous.

The string comparison is case-sensitive; If you want to make it case-insensitive, change LIKE to ILIKE. As a bonus feature, you will get multiple matching strings (obviously all of the same length) that are all present in all rows. On the down side, it will report multiple identical strings once it gets down to single char matches (and possible some 2-char and longer strings). You can use a SELECT DISTINCT * to get rid of those duplicates.

CREATE FUNCTION find_longest_string_in_T1() RETURNS SETOF text AS $$
DECLARE
  shortest  varchar;       -- The shortest string in T1(col) so the longest possible match
  candidate varchar;       -- Candidate string to test
  sz_sh     integer;       -- Length of "shortest"
  l         integer := 1;  -- Starting position of "candidate" in "shortest"
  sz        integer;       -- Length of "candidate"
  fail      boolean;       -- Has "candidate" been found in T1(col)?
  found_one boolean := false; -- Flag if we found at least one match
BEGIN
  -- Find the shortest string and its size, don't worry about multiples, need just 1
  SELECT col, char_length(col) INTO shortest, sz_sh
  FROM T1
  ORDER BY char_length(col) ASC NULLS LAST
  LIMIT 1;

  -- Get all the candidates from the shortest string and test them from longest to single char
  candidate := shortest;
  sz := sz_sh;
  LOOP
    -- Check rows in T1 if they contain the candidate string.
    -- Short-circuit as soon as a record does not match the candidate
    <<check_T1>>
    BEGIN
      FOR fail IN SELECT col NOT LIKE '%' || candidate || '%' FROM T1 LOOP
        EXIT check_T1 WHEN fail;
      END LOOP;
      -- Block was not exited, so the candidate is present in all rows: we have a match
      RETURN NEXT candidate;
      found_one := true;
    END;

    -- Produce the next candidate
    IF l+sz > sz_sh THEN -- "candidate" reaches to the end of "shortest"
      -- Exit if we already have at least one matching candidate
      EXIT WHEN found_one;
      -- .. otherwise shorthen the candidate
      sz := sz - 1;
      l := 1;
    ELSE
      -- Exit with empty result when all candidates have been examined
      EXIT WHEN l = sz_sh;
      -- .. otherwise move one position over to get the next candidate
      l := l + 1;
    END IF;
    candidate := substring(shortest from l for sz);
  END LOOP;

  RETURN;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

Calling SELECT * FROM find_longest_string_in_T1() should do the trick.

Simple testing

Create some test data:

INSERT INTO T1 
  SELECT 'hello' || md5(random()::text) || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1 
  SELECT md5(random()::text) || 'match' || 'hello' || md5(random()::text)  || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1 
  SELECT 'match' || md5(random()::text) || 'hello' || md5(random()::text)  || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1 
  SELECT md5(random()::text) || 'hello' || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);

This generates 100 rows of 106 characters long and yields the matches "hello" and "match" (and very unlikely any other matches). This produces the correct two strings in less than half a second (no frills Ubuntu server, PG 9.3, CPU i5, 4GB of RAM).

Upvotes: 2

Brian DeMilia
Brian DeMilia

Reputation: 13248

If you're okay with getting the most commonly appearing word among all rows (the most common word that is separated by a space), you could use:

select word, count(distinct rn) as num_rows
from(
select unnest(string_to_array(col, ' ')) as word,
       row_number() over(order by col) as rn
from tbl
) x
group by word
order by num_rows desc

Fiddle: http://sqlfiddle.com/#!15/bc803/9/0

Note that this finds the word apple among 4 rows, not 5. This is because APPLE123 is one word, whereas APPLE 123 would be two words, one of which is APPLE, and would count, but it doesn't.

Upvotes: 2

Related Questions