paqash
paqash

Reputation: 2314

Phrase comparison in PostgreSQL

How would I go about looking through a varchar column in a Postgres table for rows that contain the same 3 word phrase?

Most of the full text search advice in other questions is comparing vectors to specific queries, but what I'm looking for is rows that contain any 3 word phrase as other rows.

Example:

SELECT * 
FROM types t1 
WHERE EXISTS (SELECT * 
              FROM types t2 
              WHERE t1.name phrase_matches t2.name 
                AND t1.id > t2.id)

Here, phrase_matches is a made up operation where

'my foo bar baz' phrase_matches 'foo bar baz whatever' returns true

and

'my foo bar baz' phrase_matches 'foo baz whatever bar' returns false

Edit: Update for anyone coming from Google - the solution without the temp table, using a join, took more than an hour on a table with 18k rows. The temp table version ran in a few seconds total.

Upvotes: 0

Views: 257

Answers (3)

S-Man
S-Man

Reputation: 23766

demo: db<>fiddle

WITH words AS (
    SELECT phrase, unnest, row_number() OVER ()
    FROM (
        SELECT phrase, unnest(string_to_array(phrase, ' '))
        FROM phrases
    )s
), phrase_parts AS (

    SELECT 
        phrase, array_to_string(array_agg, ' ') as check_phrase
    FROM (
        SELECT
            w1.phrase, array_agg(w2.unnest) OVER (PARTITION BY w1.row_number ORDER BY w2.row_number)
        FROM words w1
        JOIN words w2
        ON w1.phrase = w2.phrase and w1.row_number <= w2.row_number

        ORDER BY w1.row_number, w2.row_number
    ) s
    WHERE array_length(array_agg, 1) = 3
)
SELECT p.phrase as a, pp.phrase as b, pp.check_phrase 
FROM 
    phrases p 
JOIN 
    phrase_parts pp 
ON p.phrase LIKE '%' || pp.check_phrase || '%' and p.phrase <> pp.phrase

Expanded data set:

phrase
my foo bar baz
foo baz whatever bar
foo bar baz whatever
blah my foo bar blah
blah my foo baz blah

Result:

a                      b                      check_phrase
blah my foo bar blah   my foo bar baz         my foo bar
foo bar baz whatever   my foo bar baz         foo bar baz
my foo bar baz         foo bar baz whatever   foo bar baz
blah my foo baz blah   blah my foo bar blah   blah my foo
my foo bar baz         blah my foo bar blah   my foo bar
blah my foo bar blah   blah my foo baz blah   blah my foo
  1. CTE words creates a list of all words of all phrases. All words are getting an index to ensure the original order within their phrases.

  2. CTE phrase_parts creates all possible 3 word phrases: For every original phrase all words are joined.

After joining the result looks like:

phrase                 unnest   row_number   phrase                 unnest     row_number
my foo bar baz         my       1            my foo bar baz         my         1
my foo bar baz         my       1            my foo bar baz         foo        2
my foo bar baz         my       1            my foo bar baz         bar        3
my foo bar baz         my       1            my foo bar baz         baz        4
my foo bar baz         foo      2            my foo bar baz         foo        2
my foo bar baz         foo      2            my foo bar baz         bar        3
my foo bar baz         foo      2            my foo bar baz         baz        4
my foo bar baz         bar      3            my foo bar baz         bar        3
my foo bar baz         bar      3            my foo bar baz         baz        4
my foo bar baz         baz      4            my foo bar baz         baz        4
foo baz whatever bar   foo      5            foo baz whatever bar   foo        5
foo baz whatever bar   foo      5            foo baz whatever bar   baz        6
foo baz whatever bar   foo      5            foo baz whatever bar   whatever   7
foo baz whatever bar   foo      5            foo baz whatever bar   bar        8
foo baz whatever bar   baz      6            foo baz whatever bar   baz        6
...

With the window function array_agg() I can aggregate the second unnest column this way:

array_agg
{my}
{my,foo}
{my,foo,bar}
{my,foo,bar,baz}
{foo}
{foo,bar}
{foo,bar,baz}
{bar}
{bar,baz}
{baz}
{foo}
{foo,baz}
{foo,baz,whatever}
{foo,baz,whatever,bar}
...

This is filtered for array length = 3 and reconverted into a string. The result are the 3 word phrases:

  1. Last step is to check all phrases in the table for containing any of the 3 word phrases (and that are not the equal to their source phrases)

Upvotes: 1

Amadan
Amadan

Reputation: 198496

Make a table of trigrams-to-row-ids, then self-join on trigram column. Wastes lots of space, but hands down the easiest way to do it. With some help from klin's answer to How to extract n-gram word sequences from text in Postgres:

-- your table
CREATE TABLE phrases (
  id INT,
  phrase TEXT
);

-- your data
INSERT INTO phrases (id, phrase) VALUES
(1, 'my foo bar baz'),
(2, 'foo bar baz whatever'),
(3, 'foo baz whatever bar');

-- function to extract word n-grams
-- from https://stackoverflow.com/a/51571001/240443
CREATE OR REPLACE FUNCTION word_ngrams(str TEXT, n INT)
RETURNS SETOF TEXT LANGUAGE plpgsql AS $$
DECLARE
    i INT;
    arr TEXT[];
BEGIN
    str := regexp_replace(str, '[^[:alnum:]|\s]', '', 'g');
    arr := string_to_array(str, ' ');
    FOR i in 1 .. cardinality(arr) - n + 1 LOOP
        RETURN NEXT array_to_string(arr[i : i+n-1], ' ');
    END LOOP;
END $$;

-- table of all trigrams (my foo bar, foo bar baz, bar baz whatever...)
-- and rows they belong to
CREATE TEMPORARY TABLE trigrams (
  id INT,
  trigram TEXT
);

-- make sure JOIN doesn't take forever
CREATE INDEX ON trigrams (trigram, id);

-- extract the trigrams into their stylish new - yet temporary - home
INSERT INTO trigrams SELECT id, word_ngrams(phrase, 3) FROM phrases;

-- see which original rows have common trigrams
SELECT DISTINCT T1.id AS id1, T2.id AS id2
FROM trigrams T1 JOIN trigrams T2
  ON T1.trigram = T2.trigram
  AND T1 < T2;

-- | id1 | id2
---+-----+----
-- |   1 |   2

You can also use the word_ngrams function directly, without a temp table, but it will be much slower. Time or space, pick only one :P This replaces everything in the previous code snippet from CREATE TEMPORARY TABLE onwards (but still uses klin's wonderful function).

SELECT DISTINCT T1.id AS id1, T2.id AS id2
FROM phrases T1 JOIN phrases T2
  ON EXISTS (
    SELECT word_ngrams(T1.phrase, 3)
    INTERSECT
    SELECT word_ngrams(T2.phrase, 3)
  )
  AND T1.id < T2.id;

-- | id1 | id2
---+-----+----
-- |   1 |   2

Upvotes: 2

Vishal Raghavan
Vishal Raghavan

Reputation: 483

There might be better options but you can also do something like this. Not exactly the thing you are asking for, but i am sure you will be able to take it forward with this idea.

select n.name from(
select x.name as xname,count(*) from 
(
  (
    select name,unnest(string_to_array(name2,' '))  as name2
                              from new
  )as x
    inner join
    (
        select name,unnest(string_to_array(name,' ')) as name1
         from new
    )as y
    on x.name2=y.name1 and y.id>x.id
) group by x.name having count(*)>=3)r inner join new n on r.xname=n.name

Here's a fiddle for the same:https://www.db-fiddle.com/f/phLirNij577PwEpd8UERef/0

Note i haven't included id in my fiddle but you can do so yourself.

Upvotes: 0

Related Questions