Reputation: 1203
I have a table T1:
Col
-------
1 THE APPLE
THE APPLE
THE APPLE 123
THE APPLE 12/16
BEST THE APPLE
I want T2:
Result
--------
THE APPLE
I am using Redshift, Looking for some way to do a fuzzystring match in SQL. Longest length possible for column is 100 characters. At no point will I have to compare more than 25 rows.
Upvotes: 2
Views: 3527
Reputation: 32326
This question requires a fair degree of complication to solve and it's running time will drastically increase with increasing string lengths and number of records. However, given that your table T1 is rather small, you might just manage with the below PL/pgSQL function.
NULL
.The important thing in the code below is the short-circuit in the check for a match: as soon as a single record does not match col
to the candidate string there is no need to check further. So for long strings, the comparison is really from the shortest string with one other string, increasing the rows examined only when candidate strings get so short that they are indeed more ubiquitous.
The string comparison is case-sensitive; If you want to make it case-insensitive, change LIKE
to ILIKE
. As a bonus feature, you will get multiple matching strings (obviously all of the same length) that are all present in all rows. On the down side, it will report multiple identical strings once it gets down to single char matches (and possible some 2-char and longer strings). You can use a SELECT DISTINCT *
to get rid of those duplicates.
CREATE FUNCTION find_longest_string_in_T1() RETURNS SETOF text AS $$
DECLARE
shortest varchar; -- The shortest string in T1(col) so the longest possible match
candidate varchar; -- Candidate string to test
sz_sh integer; -- Length of "shortest"
l integer := 1; -- Starting position of "candidate" in "shortest"
sz integer; -- Length of "candidate"
fail boolean; -- Has "candidate" been found in T1(col)?
found_one boolean := false; -- Flag if we found at least one match
BEGIN
-- Find the shortest string and its size, don't worry about multiples, need just 1
SELECT col, char_length(col) INTO shortest, sz_sh
FROM T1
ORDER BY char_length(col) ASC NULLS LAST
LIMIT 1;
-- Get all the candidates from the shortest string and test them from longest to single char
candidate := shortest;
sz := sz_sh;
LOOP
-- Check rows in T1 if they contain the candidate string.
-- Short-circuit as soon as a record does not match the candidate
<<check_T1>>
BEGIN
FOR fail IN SELECT col NOT LIKE '%' || candidate || '%' FROM T1 LOOP
EXIT check_T1 WHEN fail;
END LOOP;
-- Block was not exited, so the candidate is present in all rows: we have a match
RETURN NEXT candidate;
found_one := true;
END;
-- Produce the next candidate
IF l+sz > sz_sh THEN -- "candidate" reaches to the end of "shortest"
-- Exit if we already have at least one matching candidate
EXIT WHEN found_one;
-- .. otherwise shorthen the candidate
sz := sz - 1;
l := 1;
ELSE
-- Exit with empty result when all candidates have been examined
EXIT WHEN l = sz_sh;
-- .. otherwise move one position over to get the next candidate
l := l + 1;
END IF;
candidate := substring(shortest from l for sz);
END LOOP;
RETURN;
END;
$$ LANGUAGE plpgsql IMMUTABLE;
Calling SELECT * FROM find_longest_string_in_T1()
should do the trick.
Create some test data:
INSERT INTO T1
SELECT 'hello' || md5(random()::text) || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1
SELECT md5(random()::text) || 'match' || 'hello' || md5(random()::text) || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1
SELECT 'match' || md5(random()::text) || 'hello' || md5(random()::text) || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1
SELECT md5(random()::text) || 'hello' || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);
This generates 100 rows of 106 characters long and yields the matches "hello" and "match" (and very unlikely any other matches). This produces the correct two strings in less than half a second (no frills Ubuntu server, PG 9.3, CPU i5, 4GB of RAM).
Upvotes: 2
Reputation: 13248
If you're okay with getting the most commonly appearing word among all rows (the most common word that is separated by a space), you could use:
select word, count(distinct rn) as num_rows
from(
select unnest(string_to_array(col, ' ')) as word,
row_number() over(order by col) as rn
from tbl
) x
group by word
order by num_rows desc
Fiddle: http://sqlfiddle.com/#!15/bc803/9/0
Note that this finds the word apple
among 4 rows, not 5. This is because APPLE123
is one word, whereas APPLE 123
would be two words, one of which is APPLE, and would count, but it doesn't.
Upvotes: 2