redshift sql remove all words in a given list from sentence strings

Question

I have a redshift table with chat messages, such that each row contains: timestamp, user_id, msg_body.

I also have a list of 'bad' words bad_words containing 200+ words.

Each msg_body has chat messages comprised of a bunch of words, some of which can be bad. I would like to remove from each msg_body all occurrences of words in bad_words and replace with empty strings and save in a new columns new_body.

Here is some nonworking pseudocode portraying what I want to do:

select timestamp, user_id, mgs_body, 
   case when (body SIMILAR TO (select distinct words from bad_words) 
        then (do something like replace(body,badword,'')) end as new_body 
from chat_messages

Haleemur Ali · Accepted Answer

One option would be to use regexp_replace

Construct a pattern that would match all of the bad words like this '\b(very bad|word|not-good)\b' using list_agg

WITH re AS (SELECT '\b(' || LISTAGG(DISTINCT words, '|') || ')\b' pattern FROM bad_words)
SELECT
    timestamp
  , user_id
  , body
  , REGEXP_REPLACE(body, re.pattern, '') new_body
FROM chat_messages, re

redshift sql remove all words in a given list from sentence strings

Answers (1)

Related Questions