Andrey Deineko
Andrey Deineko

Reputation: 52347

Query to remove "duplicate" rows using regexp

I am using PostgreSQL. I have a table keywords:

# Table name: keywords
#
#  id         :integer not null, primary key
#  text       :string  not null
#  match_type :string  not null
#  adgroup_id :integer not null

Table has a uniq index USING btree (match_type, adgroup_id, text)

Now, issue is that for same adgroup_id and match_type there are texts like "Hello" and " Hello" or "Hello " or " Hello " (note the leading/trailing whitespaces). The issue is that text column contains those spaces in the beginning and end of string causing bad data (which would not have passed the uniq index without those whitespaces).

I am planning on adding a white-space trimming before insertion in the future, but first I need to clean up the data.

How do I remove the "duplicate" data leaving the unique ones (based on the string comparison without leading and trailing spaces)?

Upvotes: 2

Views: 301

Answers (2)

S-Man
S-Man

Reputation: 23666

demo:db<>dbfiddle (example contains two groups: "Hello" without an element without whitespace; "Bye" contains two elements without whitespaces)

DELETE FROM keywords
WHERE id NOT IN (
    SELECT DISTINCT ON (trim(text))                 --1
        id
    FROM
        keywords
    ORDER BY 
        trim(text), 
        text = trim(text) DESC                   --2
)
  1. Grouping on trimmed texts.
  2. Order by trimmed texts and the information if the text is the one without whitespace. If there is one element then it will be ordered first and taken by the DISTINCT ON clause. If there is none another element will be taken

The solution containing the additional columns:

    DELETE FROM keywords
    WHERE id NOT IN (
        SELECT DISTINCT ON (match_type, adgroup_id, trim(text))
            id
        FROM
            keywords
        ORDER BY 
            match_type,
            adgroup_id,
            trim(text), 
            text = trim(text) DESC
    )

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520908

Here is one option, using a CTE. The CTE finds all (match_type, adgroup_id) groups having two or more text values which are identical with leading and trailing whitespace trimmed. We also compute the following along the way:

  • cnt - for each group, the number of times that the "pure" version of the text appears. Pure here means that text with no leading or trailing whitespace
  • rn - an arbitrary row number for each (match_type, adgroup_id) group, starting at value 1


We then delete a row only if it appears within a duplicate group and either it is not the pure version of text (cnt > 0), or the arbitrary row number is greater than one. This means that for the case "Hello " and " Hello", one of these two records would be arbitrarily deleted. But, if there were a third "pure" record with "Hello", then this would be retained and both of the previous two cased would be deleted.

with cte as (
    select match_type, adgroup_id, trim(text) as text,
        count(case when text = trim(text) then 1 end) as cnt,
        row_number() over (partition by match_type, adgroup_id order by trim(text)) rn
    from keywords
    group by match_type, adgroup_id, trim(text)
    having count(*) > 1
)

delete
from keywords k1
where exists (select 1 from cte k2
              where k1.match_type = k2.match_type and
                    k1.adgroup_id = k2.adgroup_id and
                    k1.text <> k2.text and (k2.cnt > 0 or k2.rn > 1));

Upvotes: 1

Related Questions