Reputation: 307
Trying to find all rows (and their count) with the same words in a column. The criteria of match is a presence of all words from the string A among all the words of the string B (without paying attention on upper/lower case and words order). E.g. 'lorem ipsum' == 'ipsum LOrem'. I tried the next query:
select t1.title, t1.id, t2.title, t2.id
from my_table t1
join my_table t2 on t1.id != t2.id and string_to_array(t1.title, ' ') <@ string_to_array(t2.title, ' ');
but it includes just a partial match e.g. 'lorem dolor ipsum' == 'ipsum LOrem' into results.
Upvotes: 1
Views: 1109
Reputation: 121889
Example table:
drop table if exists my_table;
create table my_table(id int primary key, title text);
insert into my_table values
(1, 'lorem ipsum'),
(2, 'IPSUM LOREM'),
(3, 'lorem dolor ipsum');
Select ordered case-insensitive arrays of words for each id:
select id, title, array_agg(word order by word) as words
from (
select id, lower(regexp_split_to_table(title, '\s+')) word
from my_table
) s
join my_table using(id)
group by 1, 2;
id | title | words
----+-------------------+---------------------
1 | lorem ipsum | {ipsum,lorem}
2 | IPSUM LOREM | {ipsum,lorem}
3 | lorem dolor ipsum | {dolor,ipsum,lorem}
(3 rows)
Use this result to find matching pairs using simple equality operator:
with ordered_words as (
select id, title, array_agg(word order by word) as words
from (
select id, lower(regexp_split_to_table(title, '\s+')) word
from my_table
) s
join my_table using(id)
group by 1, 2
)
select t1.title, t1.id, t2.title, t2.id
from ordered_words t1
join ordered_words t2 on t1.id < t2.id and t1.words = t2.words;
title | id | title | id
-------------+----+-------------+----
lorem ipsum | 1 | IPSUM LOREM | 2
(1 row)
Note, in the join condition I have used <
for ids
instead of !=
to eliminate duplicates in reverse order.
Alternative solution - add lower()
and symmetric comparison to your query:
select t1.title, t1.id, t2.title, t2.id
from my_table t1
join my_table t2
on t1.id < t2.id
and string_to_array(lower(t1.title), ' ') <@ string_to_array(lower(t2.title), ' ')
and string_to_array(lower(t1.title), ' ') @> string_to_array(lower(t2.title), ' ');
This query looks simpler than the previous one but in fact it is much less efficient.
Upvotes: 2