Reputation: 73
is it possible to write a query, which can find duplicates (similar) values by pattern, without spaces between words, only by 3-5 words, all of them lower (upper) case?
I have documents table with many columns, which one of them is 'title'. I need to find documents by title, but title may differ like one with two spaces between words, lover upper case.
Or maybe it can find duplicates similar, where string begins with three - five words
The query:
SELECT title, COUNT(title)
FROM doc_documents
where not deleted and status ='CONFIRMED'
GROUP BY title
HAVING ( COUNT(title) > 1 )
order by count
Works sort of ok, but it did not find any values which differs with to spaces between word.
Like:
10-12 year classmates, which learns differently"
11 – 12 year classmates, which learns differently
Also is it possible to find only by three words, ignoring spaces and left of the string, like:
10-12 year classmates
and 11 – 12 year classmates
will be found?
I can't think any of the solutions
Upvotes: 1
Views: 1039
Reputation: 4503
-- sample table and data
CREATE TABLE titles
( id serial NOT NULL PRIMARY KEY
, title text
);
INSERT INTO titles ( title ) VALUES
('10-12 year classmates, which learns differently')
, ('10-12 year classmates, which learns differently')
, (' 11 – 12 year classmates, which learns differently');
-- CTE performing the regexp and array magic
WITH tit AS (
SELECT t.id
, array_to_string( regexp_split_to_array( btrim(t.title) , E'[^0-9A-Za-z]+'), ' ') AS tit
, t.title AS org -- you could add a ',' after the 'z' here: ---------- ^
FROM titles t
)
-- Use the CTE to see if it works
SELECT tit
-- , MIN(org) AS one
-- , MAX(org) AS two
, COUNT(*) AS cnt
FROM tit
GROUP BY tit
;
Upvotes: 1