hustas88
hustas88

Reputation: 307

postgres find all rows with the same words in a column

Trying to find all rows (and their count) with the same words in a column. The criteria of match is a presence of all words from the string A among all the words of the string B (without paying attention on upper/lower case and words order). E.g. 'lorem ipsum' == 'ipsum LOrem'. I tried the next query:

select t1.title, t1.id, t2.title, t2.id 
from my_table t1
join my_table t2 on t1.id != t2.id and string_to_array(t1.title, ' ') <@ string_to_array(t2.title, ' ');

but it includes just a partial match e.g. 'lorem dolor ipsum' == 'ipsum LOrem' into results.

Upvotes: 1

Views: 1109

Answers (1)

klin
klin

Reputation: 121889

Example table:

drop table if exists my_table;
create table my_table(id int primary key, title text);
insert into my_table values
    (1, 'lorem ipsum'),
    (2, 'IPSUM LOREM'),
    (3, 'lorem dolor ipsum');

Select ordered case-insensitive arrays of words for each id:

select id, title, array_agg(word order by word) as words
from (
    select id, lower(regexp_split_to_table(title, '\s+')) word
    from my_table
    ) s
join my_table using(id)
group by 1, 2;

 id |       title       |        words        
----+-------------------+---------------------
  1 | lorem ipsum       | {ipsum,lorem}
  2 | IPSUM LOREM       | {ipsum,lorem}
  3 | lorem dolor ipsum | {dolor,ipsum,lorem}
(3 rows)

Use this result to find matching pairs using simple equality operator:

with ordered_words as (
    select id, title, array_agg(word order by word) as words
    from (
        select id, lower(regexp_split_to_table(title, '\s+')) word
        from my_table
        ) s
    join my_table using(id)
    group by 1, 2
)

select t1.title, t1.id, t2.title, t2.id 
from ordered_words t1
join ordered_words t2 on t1.id < t2.id and t1.words = t2.words;

    title    | id |    title    | id 
-------------+----+-------------+----
 lorem ipsum |  1 | IPSUM LOREM |  2
(1 row) 

Note, in the join condition I have used < for ids instead of != to eliminate duplicates in reverse order.


Alternative solution - add lower() and symmetric comparison to your query:

select t1.title, t1.id, t2.title, t2.id 
from my_table t1
join my_table t2 
    on t1.id < t2.id 
    and string_to_array(lower(t1.title), ' ') <@ string_to_array(lower(t2.title), ' ')
    and string_to_array(lower(t1.title), ' ') @> string_to_array(lower(t2.title), ' ');

This query looks simpler than the previous one but in fact it is much less efficient.

Upvotes: 2

Related Questions