Reputation: 52347
I am using PostgreSQL. I have a table keywords
:
# Table name: keywords
#
# id :integer not null, primary key
# text :string not null
# match_type :string not null
# adgroup_id :integer not null
Table has a uniq index USING btree (match_type, adgroup_id, text)
Now, issue is that for same adgroup_id
and match_type
there are texts like "Hello"
and " Hello"
or "Hello "
or " Hello "
(note the leading/trailing whitespaces).
The issue is that text
column contains those spaces in the beginning and end of string causing bad data (which would not have passed the uniq index without those whitespaces).
I am planning on adding a white-space trimming before insertion in the future, but first I need to clean up the data.
How do I remove the "duplicate" data leaving the unique ones (based on the string comparison without leading and trailing spaces)?
Upvotes: 2
Views: 301
Reputation: 23666
demo:db<>dbfiddle (example contains two groups: "Hello" without an element without whitespace; "Bye" contains two elements without whitespaces)
DELETE FROM keywords
WHERE id NOT IN (
SELECT DISTINCT ON (trim(text)) --1
id
FROM
keywords
ORDER BY
trim(text),
text = trim(text) DESC --2
)
DISTINCT ON
clause. If there is none another element will be takenThe solution containing the additional columns:
DELETE FROM keywords
WHERE id NOT IN (
SELECT DISTINCT ON (match_type, adgroup_id, trim(text))
id
FROM
keywords
ORDER BY
match_type,
adgroup_id,
trim(text),
text = trim(text) DESC
)
Upvotes: 1
Reputation: 520908
Here is one option, using a CTE. The CTE finds all (match_type, adgroup_id)
groups having two or more text
values which are identical with leading and trailing whitespace trimmed. We also compute the following along the way:
(match_type, adgroup_id)
group, starting at value 1
We then delete a row only if it appears within a duplicate group and either it is not the pure version of text (cnt > 0
), or the arbitrary row number is greater than one. This means that for the case "Hello "
and " Hello"
, one of these two records would be arbitrarily deleted. But, if there were a third "pure" record with "Hello"
, then this would be retained and both of the previous two cased would be deleted.
with cte as (
select match_type, adgroup_id, trim(text) as text,
count(case when text = trim(text) then 1 end) as cnt,
row_number() over (partition by match_type, adgroup_id order by trim(text)) rn
from keywords
group by match_type, adgroup_id, trim(text)
having count(*) > 1
)
delete
from keywords k1
where exists (select 1 from cte k2
where k1.match_type = k2.match_type and
k1.adgroup_id = k2.adgroup_id and
k1.text <> k2.text and (k2.cnt > 0 or k2.rn > 1));
Upvotes: 1