Reputation: 1973
I have a column in BigQuery table and I want to do some natural language pre-processing on it. Hence, I want to retain only characters from a-z
and ignore others. I also want to ignore the words in the string that are single characters.
How can I do it best using big-query?
Sample input -
with data as (
select "efficacy!! and/or lasting affects whether community protected" as ip union all
select "n/a" as ip union all
select "this questions is un-clear information" union all
select "I m 84-years old!!!" union all select "none"
)
select * from data
so I have certain stop words like a,am,the,none,na
etc.. which I want to remove from the text, I also dont want to keep retain single characters in the string.
Expected output -
efficacy and or lasting affects whether community protected
''
this questions un clear information
years old
''
The 2nd and 5th data points are blank because they contain stop words.
Upvotes: 1
Views: 1052
Reputation: 172993
Consider below alternative
with data as (
select "efficacy!! and/or lasting affects whether community protected" as ip union all
select "n/a" as ip union all
select "this questions is un-clear information" union all
select "I m 84-years old!!!" union all
select "none"
), stop_words as (
select 'am|the|none|na|is' list
), pattern_to_remove as (
select r'a-zA-Z' remove
)
select trim(regexp_replace(regexp_replace(regexp_replace(
ip, r'[^' || remove || r']', ' '), r'(?:\b)(' || list || r'|[' || remove || r'])(?:\b)', ''), r'[ ]+', ' '
)) as ip
from data, stop_words, pattern_to_remove
looks a little overengineered, but has benefit of having generic query with stop words and to exclude chars to be controlled within the respective CTEs
Upvotes: 1
Reputation: 10172
Try regexp_extract_all to extract all words and then filter them:
with data as (
select "efficacy!! and/or lasting affects whether community protected" as ip union all
select "n/a" as ip union all
select "this questions is un-clear information" union all
select "I m 84-years old!!!" union all
select "none"
)
select STRING_AGG(word, " " ORDER BY offset)
from (
select ip, word, offset
from data, unnest(regexp_extract_all(ip, "[a-z]{2,}")) as word with offset
where word not in ("am", "the", "none", "na", "is")
)
group by ip
Upvotes: 2