Regressor
Regressor

Reputation: 1973

How to remove all stop words and single characters in Bigquery String column

I have a column in BigQuery table and I want to do some natural language pre-processing on it. Hence, I want to retain only characters from a-z and ignore others. I also want to ignore the words in the string that are single characters.

How can I do it best using big-query?

Sample input -

with data as (
select "efficacy!! and/or lasting affects whether community protected" as ip union all
select "n/a" as ip union all
select "this questions is un-clear information" union all
select "I m 84-years old!!!" union all select "none"
)
select * from data

so I have certain stop words like a,am,the,none,na etc.. which I want to remove from the text, I also dont want to keep retain single characters in the string.

Expected output -

efficacy and or lasting affects whether community protected
''
this questions un clear information
years old
''

The 2nd and 5th data points are blank because they contain stop words.

Upvotes: 1

Views: 1052

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Consider below alternative

with data as (
  select "efficacy!! and/or lasting affects whether community protected" as ip union all
  select "n/a" as ip union all
  select "this questions is un-clear information" union all
  select "I m 84-years old!!!" union all
  select "none"
), stop_words as (
  select 'am|the|none|na|is' list    
), pattern_to_remove as (
  select r'a-zA-Z' remove
)
  select trim(regexp_replace(regexp_replace(regexp_replace(
      ip, r'[^' || remove || r']', ' '), r'(?:\b)(' || list || r'|[' || remove || r'])(?:\b)', ''), r'[ ]+', ' '
     )) as ip
  from data, stop_words, pattern_to_remove   

looks a little overengineered, but has benefit of having generic query with stop words and to exclude chars to be controlled within the respective CTEs

Upvotes: 1

Sergey Geron
Sergey Geron

Reputation: 10172

Try regexp_extract_all to extract all words and then filter them:

with data as (
  select "efficacy!! and/or lasting affects whether community protected" as ip union all
  select "n/a" as ip union all
  select "this questions is un-clear information" union all
  select "I m 84-years old!!!" union all
  select "none"
)
select STRING_AGG(word, " " ORDER BY offset)
from (
  select ip, word, offset
  from data, unnest(regexp_extract_all(ip, "[a-z]{2,}")) as word with offset
  where word not in ("am", "the", "none", "na", "is")
)
group by ip

enter image description here

Upvotes: 2

Related Questions