Gumuliokas
Gumuliokas

Reputation: 73

Finding duplicated values by pattern in postgres sql

is it possible to write a query, which can find duplicates (similar) values by pattern, without spaces between words, only by 3-5 words, all of them lower (upper) case?

I have documents table with many columns, which one of them is 'title'. I need to find documents by title, but title may differ like one with two spaces between words, lover upper case.

Or maybe it can find duplicates similar, where string begins with three - five words

The query:

SELECT title, COUNT(title) 
FROM doc_documents
where not deleted and status ='CONFIRMED'
GROUP BY title
HAVING ( COUNT(title) > 1 )
order by count

Works sort of ok, but it did not find any values which differs with to spaces between word.

Like: 10-12 year classmates, which learns differently" 11 – 12 year classmates, which learns differently

Also is it possible to find only by three words, ignoring spaces and left of the string, like: 10-12 year classmates and 11 – 12 year classmates will be found?

I can't think any of the solutions

Upvotes: 1

Views: 1039

Answers (1)

joop
joop

Reputation: 4503

  • use a regexp to split the title string into an array of wanted words
  • implode this array back into a string
  • group on this string, or us it as a canonical identifier for the fuzzy string
  • YMMV

 -- sample table and data
CREATE TABLE titles
        ( id serial NOT NULL PRIMARY KEY
        , title text
        );

INSERT INTO titles ( title ) VALUES
  ('10-12 year classmates, which learns differently')
, ('10-12 year classmates, which learns  differently')
, (' 11 – 12 year classmates,  which learns differently');

-- CTE performing the regexp and array magic
WITH tit AS (
        SELECT t.id
        , array_to_string( regexp_split_to_array( btrim(t.title) , E'[^0-9A-Za-z]+'), ' ') AS tit
        , t.title AS org -- you could add a ',' after the 'z' here: ---------- ^
        FROM titles t
        )
    -- Use the CTE to see if it works
SELECT tit
       -- , MIN(org) AS one
       -- , MAX(org) AS two
        , COUNT(*) AS cnt
FROM tit
GROUP BY tit
        ;

Upvotes: 1

Related Questions