How to remove all stop words and single characters in Bigquery String column

Question

I have a column in BigQuery table and I want to do some natural language pre-processing on it. Hence, I want to retain only characters from a-z and ignore others. I also want to ignore the words in the string that are single characters.

How can I do it best using big-query?

Sample input -

with data as (
select "efficacy!! and/or lasting affects whether community protected" as ip union all
select "n/a" as ip union all
select "this questions is un-clear information" union all
select "I m 84-years old!!!" union all select "none"
)
select * from data

so I have certain stop words like a,am,the,none,na etc.. which I want to remove from the text, I also dont want to keep retain single characters in the string.

Expected output -

efficacy and or lasting affects whether community protected
''
this questions un clear information
years old
''

The 2nd and 5th data points are blank because they contain stop words.

Sergey Geron · Accepted Answer

Try regexp_extract_all to extract all words and then filter them:

with data as (
  select "efficacy!! and/or lasting affects whether community protected" as ip union all
  select "n/a" as ip union all
  select "this questions is un-clear information" union all
  select "I m 84-years old!!!" union all
  select "none"
)
select STRING_AGG(word, " " ORDER BY offset)
from (
  select ip, word, offset
  from data, unnest(regexp_extract_all(ip, "[a-z]{2,}")) as word with offset
  where word not in ("am", "the", "none", "na", "is")
)
group by ip

How to remove all stop words and single characters in Bigquery String column

Answers (2)

Related Questions