Mageshkumar Duraisamy
Mageshkumar Duraisamy

Reputation: 11

Redshift: regexp to remove duplicates within column data

I have the requirement to write a query in redshift database to remove the duplicates within column.

select regexp_replace('GiftCard,GiftCard',  '([^,]*)(,\2)+($|,)', '\2\3')

Expected results : GiftCard

obtained results: GiftCard,GiftCard

Basically, I want to search for the value in the column and remove if repeated.

Can any one help me on this?

Upvotes: 1

Views: 769

Answers (1)

Yoli
Yoli

Reputation: 65

Not sure this can be done using just a regex query but as Jon mentioned a UDF will work quite well.

Just split the text on the comma, create a set of unique words, and return it in some format. The function would be something like:

CREATE FUNCTION f_unique_words (s text)
    RETURNS text
IMMUTABLE
AS $$
    return ','.join(set(s.split(',')))
$$ LANGUAGE plpythonu;

Example usage:

> select f_unique_words('GiftCard,GiftCard');
[GiftCard]
> select f_unique_words('GiftCard,Cat,Dog,Cat,Cat,Frog,frog,GiftCard');
[frog,GiftCard,Dog,Frog,Cat]

It does depend on you having appropriate access rights to the cluster. To create the function, also make sure you have USAGE granted on language 'plpythonu' for your user.

As a side note if you want a case insensitive version of this that doesn't put all your output in lower case, this would do:

CREATE FUNCTION f_unique_words_ignore_case (s text)
    RETURNS text
IMMUTABLE
AS $$
    wordset = set(s.split(','))
    return ','.join(item for item in wordset if item.istitle() or item.title() not in wordset)
$$ LANGUAGE plpythonu;

Example usage:

> select f_unique_words_ignore_case('GiftCard,Cat,Dog,Cat,Cat,Frog,frog,GiftCard');
[GiftCard,Dog,Frog,Cat]

Upvotes: 1

Related Questions