Reputation: 126
I have a redshift table with chat messages, such that each row contains: timestamp
, user_id
, msg_body
.
I also have a list of 'bad' words bad_words
containing 200+ words.
Each msg_body
has chat messages comprised of a bunch of words, some of which can be bad. I would like to remove from each msg_body
all occurrences of words in bad_words
and replace with empty strings and save in a new columns new_body
.
Here is some nonworking pseudocode portraying what I want to do:
select timestamp, user_id, mgs_body,
case when (body SIMILAR TO (select distinct words from bad_words)
then (do something like replace(body,badword,'')) end as new_body
from chat_messages
Upvotes: 0
Views: 959
Reputation: 28253
One option would be to use regexp_replace
Construct a pattern that would match all of the bad words like this '\b(very
bad|word|not-good)\b'
using list_agg
WITH re AS (SELECT '\b(' || LISTAGG(DISTINCT words, '|') || ')\b' pattern FROM bad_words)
SELECT
timestamp
, user_id
, body
, REGEXP_REPLACE(body, re.pattern, '') new_body
FROM chat_messages, re
Upvotes: 2