Reputation: 11
I have the requirement to write a query in redshift database to remove the duplicates within column.
select regexp_replace('GiftCard,GiftCard', '([^,]*)(,\2)+($|,)', '\2\3')
Expected results : GiftCard
obtained results: GiftCard,GiftCard
Basically, I want to search for the value in the column and remove if repeated.
Can any one help me on this?
Upvotes: 1
Views: 769
Reputation: 65
Not sure this can be done using just a regex query but as Jon mentioned a UDF will work quite well.
Just split the text on the comma, create a set of unique words, and return it in some format. The function would be something like:
CREATE FUNCTION f_unique_words (s text)
RETURNS text
IMMUTABLE
AS $$
return ','.join(set(s.split(',')))
$$ LANGUAGE plpythonu;
Example usage:
> select f_unique_words('GiftCard,GiftCard');
[GiftCard]
> select f_unique_words('GiftCard,Cat,Dog,Cat,Cat,Frog,frog,GiftCard');
[frog,GiftCard,Dog,Frog,Cat]
It does depend on you having appropriate access rights to the cluster. To create the function, also make sure you have USAGE granted on language 'plpythonu' for your user.
As a side note if you want a case insensitive version of this that doesn't put all your output in lower case, this would do:
CREATE FUNCTION f_unique_words_ignore_case (s text)
RETURNS text
IMMUTABLE
AS $$
wordset = set(s.split(','))
return ','.join(item for item in wordset if item.istitle() or item.title() not in wordset)
$$ LANGUAGE plpythonu;
Example usage:
> select f_unique_words_ignore_case('GiftCard,Cat,Dog,Cat,Cat,Frog,frog,GiftCard');
[GiftCard,Dog,Frog,Cat]
Upvotes: 1