Finding duplicated values by pattern in postgres sql

Question

is it possible to write a query, which can find duplicates (similar) values by pattern, without spaces between words, only by 3-5 words, all of them lower (upper) case?

I have documents table with many columns, which one of them is 'title'. I need to find documents by title, but title may differ like one with two spaces between words, lover upper case.

Or maybe it can find duplicates similar, where string begins with three - five words

The query:

SELECT title, COUNT(title) 
FROM doc_documents
where not deleted and status ='CONFIRMED'
GROUP BY title
HAVING ( COUNT(title) > 1 )
order by count

Works sort of ok, but it did not find any values which differs with to spaces between word.

Like: 10-12 year classmates, which learns differently" 11 – 12 year classmates, which learns differently

Also is it possible to find only by three words, ignoring spaces and left of the string, like: 10-12 year classmates and 11 – 12 year classmates will be found?

I can't think any of the solutions

joop · Accepted Answer

use a regexp to split the title string into an array of wanted words
implode this array back into a string
group on this string, or us it as a canonical identifier for the fuzzy string
YMMV

 -- sample table and data
CREATE TABLE titles
        ( id serial NOT NULL PRIMARY KEY
        , title text
        );

INSERT INTO titles ( title ) VALUES
  ('10-12 year classmates, which learns differently')
, ('10-12 year classmates, which learns  differently')
, (' 11 – 12 year classmates,  which learns differently');

-- CTE performing the regexp and array magic
WITH tit AS (
        SELECT t.id
        , array_to_string( regexp_split_to_array( btrim(t.title) , E'[^0-9A-Za-z]+'), ' ') AS tit
        , t.title AS org -- you could add a ',' after the 'z' here: ---------- ^
        FROM titles t
        )
    -- Use the CTE to see if it works
SELECT tit
       -- , MIN(org) AS one
       -- , MAX(org) AS two
        , COUNT(*) AS cnt
FROM tit
GROUP BY tit
        ;

Finding duplicated values by pattern in postgres sql

Answers (1)

Related Questions