Reputation: 2314
How would I go about looking through a varchar column in a Postgres table for rows that contain the same 3 word phrase?
Most of the full text search advice in other questions is comparing vectors to specific queries, but what I'm looking for is rows that contain any 3 word phrase as other rows.
Example:
SELECT *
FROM types t1
WHERE EXISTS (SELECT *
FROM types t2
WHERE t1.name phrase_matches t2.name
AND t1.id > t2.id)
Here, phrase_matches
is a made up operation where
'my foo bar baz' phrase_matches 'foo bar baz whatever'
returns true
and
'my foo bar baz' phrase_matches 'foo baz whatever bar'
returns false
Edit: Update for anyone coming from Google - the solution without the temp table, using a join, took more than an hour on a table with 18k rows. The temp table version ran in a few seconds total.
Upvotes: 0
Views: 257
Reputation: 23766
WITH words AS (
SELECT phrase, unnest, row_number() OVER ()
FROM (
SELECT phrase, unnest(string_to_array(phrase, ' '))
FROM phrases
)s
), phrase_parts AS (
SELECT
phrase, array_to_string(array_agg, ' ') as check_phrase
FROM (
SELECT
w1.phrase, array_agg(w2.unnest) OVER (PARTITION BY w1.row_number ORDER BY w2.row_number)
FROM words w1
JOIN words w2
ON w1.phrase = w2.phrase and w1.row_number <= w2.row_number
ORDER BY w1.row_number, w2.row_number
) s
WHERE array_length(array_agg, 1) = 3
)
SELECT p.phrase as a, pp.phrase as b, pp.check_phrase
FROM
phrases p
JOIN
phrase_parts pp
ON p.phrase LIKE '%' || pp.check_phrase || '%' and p.phrase <> pp.phrase
Expanded data set:
phrase
my foo bar baz
foo baz whatever bar
foo bar baz whatever
blah my foo bar blah
blah my foo baz blah
Result:
a b check_phrase
blah my foo bar blah my foo bar baz my foo bar
foo bar baz whatever my foo bar baz foo bar baz
my foo bar baz foo bar baz whatever foo bar baz
blah my foo baz blah blah my foo bar blah blah my foo
my foo bar baz blah my foo bar blah my foo bar
blah my foo bar blah blah my foo baz blah blah my foo
CTE words
creates a list of all words of all phrases. All words are getting an index to ensure the original order within their phrases.
CTE phrase_parts
creates all possible 3 word phrases: For every original phrase all words are joined.
After joining the result looks like:
phrase unnest row_number phrase unnest row_number
my foo bar baz my 1 my foo bar baz my 1
my foo bar baz my 1 my foo bar baz foo 2
my foo bar baz my 1 my foo bar baz bar 3
my foo bar baz my 1 my foo bar baz baz 4
my foo bar baz foo 2 my foo bar baz foo 2
my foo bar baz foo 2 my foo bar baz bar 3
my foo bar baz foo 2 my foo bar baz baz 4
my foo bar baz bar 3 my foo bar baz bar 3
my foo bar baz bar 3 my foo bar baz baz 4
my foo bar baz baz 4 my foo bar baz baz 4
foo baz whatever bar foo 5 foo baz whatever bar foo 5
foo baz whatever bar foo 5 foo baz whatever bar baz 6
foo baz whatever bar foo 5 foo baz whatever bar whatever 7
foo baz whatever bar foo 5 foo baz whatever bar bar 8
foo baz whatever bar baz 6 foo baz whatever bar baz 6
...
With the window function array_agg()
I can aggregate the second unnest
column this way:
array_agg
{my}
{my,foo}
{my,foo,bar}
{my,foo,bar,baz}
{foo}
{foo,bar}
{foo,bar,baz}
{bar}
{bar,baz}
{baz}
{foo}
{foo,baz}
{foo,baz,whatever}
{foo,baz,whatever,bar}
...
This is filtered for array length = 3
and reconverted into a string. The result are the 3 word phrases:
Upvotes: 1
Reputation: 198496
Make a table of trigrams-to-row-ids, then self-join on trigram column. Wastes lots of space, but hands down the easiest way to do it. With some help from klin's answer to How to extract n-gram word sequences from text in Postgres:
-- your table
CREATE TABLE phrases (
id INT,
phrase TEXT
);
-- your data
INSERT INTO phrases (id, phrase) VALUES
(1, 'my foo bar baz'),
(2, 'foo bar baz whatever'),
(3, 'foo baz whatever bar');
-- function to extract word n-grams
-- from https://stackoverflow.com/a/51571001/240443
CREATE OR REPLACE FUNCTION word_ngrams(str TEXT, n INT)
RETURNS SETOF TEXT LANGUAGE plpgsql AS $$
DECLARE
i INT;
arr TEXT[];
BEGIN
str := regexp_replace(str, '[^[:alnum:]|\s]', '', 'g');
arr := string_to_array(str, ' ');
FOR i in 1 .. cardinality(arr) - n + 1 LOOP
RETURN NEXT array_to_string(arr[i : i+n-1], ' ');
END LOOP;
END $$;
-- table of all trigrams (my foo bar, foo bar baz, bar baz whatever...)
-- and rows they belong to
CREATE TEMPORARY TABLE trigrams (
id INT,
trigram TEXT
);
-- make sure JOIN doesn't take forever
CREATE INDEX ON trigrams (trigram, id);
-- extract the trigrams into their stylish new - yet temporary - home
INSERT INTO trigrams SELECT id, word_ngrams(phrase, 3) FROM phrases;
-- see which original rows have common trigrams
SELECT DISTINCT T1.id AS id1, T2.id AS id2
FROM trigrams T1 JOIN trigrams T2
ON T1.trigram = T2.trigram
AND T1 < T2;
-- | id1 | id2
---+-----+----
-- | 1 | 2
You can also use the word_ngrams
function directly, without a temp table, but it will be much slower. Time or space, pick only one :P This replaces everything in the previous code snippet from CREATE TEMPORARY TABLE
onwards (but still uses klin's wonderful function).
SELECT DISTINCT T1.id AS id1, T2.id AS id2
FROM phrases T1 JOIN phrases T2
ON EXISTS (
SELECT word_ngrams(T1.phrase, 3)
INTERSECT
SELECT word_ngrams(T2.phrase, 3)
)
AND T1.id < T2.id;
-- | id1 | id2
---+-----+----
-- | 1 | 2
Upvotes: 2
Reputation: 483
There might be better options but you can also do something like this. Not exactly the thing you are asking for, but i am sure you will be able to take it forward with this idea.
select n.name from(
select x.name as xname,count(*) from
(
(
select name,unnest(string_to_array(name2,' ')) as name2
from new
)as x
inner join
(
select name,unnest(string_to_array(name,' ')) as name1
from new
)as y
on x.name2=y.name1 and y.id>x.id
) group by x.name having count(*)>=3)r inner join new n on r.xname=n.name
Here's a fiddle for the same:https://www.db-fiddle.com/f/phLirNij577PwEpd8UERef/0
Note i haven't included id in my fiddle but you can do so yourself.
Upvotes: 0