Reputation: 1973
I have a text column in a bigquery table. Sample record of that column looks like -
with temp as
(
select 1 as id,"as we go forward into unchartered waters it's important to remember we are all in this together. #united #community" as input
union all
select 2 , "US cities close bars, restaurants and cinemas #Coronavirus"
)
select *
from temp
I want to extract all the words in this column that start with a #
. later on I would like to get the frequency of these terms. How do I do this in BigQuery ?
My output would look like -
id, word
1, united
1, community
2, coronavirus
Upvotes: 0
Views: 1647
Reputation: 9721
You can do this without regexes, by splitting words and then selecting ones that start the way you want. For example:
SELECT
id,
ARRAY(SELECT TRIM(x, "#") FROM UNNEST(SPLIT(input, ' ')) as x WHERE STARTS_WITH(x,'#')) str
FROM
temp
If you prefer the hashtags to be separate rows, you can be a bit tiedier:
SELECT
id,
TRIM(x, "#") str
FROM
temp,
UNNEST(SPLIT(input, ' ')) x
WHERE
STARTS_WITH(x,'#')
Upvotes: 0
Reputation: 173028
Below is for BigQuery Standard SQL
I want to extract all the words in this column that start with a #
#standardSQL
WITH temp AS (
SELECT 1 AS id,"as we go forward into unchartered waters it's important to remember we are all in this together. #united #community" AS input UNION ALL
SELECT 2 , "US cities close bars, restaurants and cinemas #Coronavirus"
)
SELECT id, word
FROM temp, UNNEST(REGEXP_EXTRACT_ALL(input, r'(?:^|\s)#([^#\s]*)')) word
with output
Row id word
1 1 united
2 1 community
3 2 Coronavirus
later on I would like to get the frequency of these terms
#standardSQL
SELECT word, COUNT(1) frequency
FROM temp, UNNEST(REGEXP_EXTRACT_ALL(input, r'(?:^|\s)#([^#\s]*)')) word
GROUP BY word
Upvotes: 1