Regressor
Regressor

Reputation: 1973

how to get all the words that start with a certain character in bigquery

I have a text column in a bigquery table. Sample record of that column looks like -

with temp as 
(
select 1 as id,"as we go forward into unchartered waters it's important to remember we are all in this together. #united #community" as input
union all
select 2 , "US cities close bars, restaurants and cinemas #Coronavirus"
)

select *
from temp

I want to extract all the words in this column that start with a # . later on I would like to get the frequency of these terms. How do I do this in BigQuery ?

My output would look like -

id, word
1, united
1, community
2, coronavirus

Upvotes: 0

Views: 1647

Answers (2)

David
David

Reputation: 9721

You can do this without regexes, by splitting words and then selecting ones that start the way you want. For example:

SELECT
  id,
  ARRAY(SELECT TRIM(x, "#") FROM UNNEST(SPLIT(input, ' ')) as x WHERE STARTS_WITH(x,'#')) str
FROM
  temp

If you prefer the hashtags to be separate rows, you can be a bit tiedier:

SELECT
  id,
  TRIM(x, "#") str
FROM
  temp,
  UNNEST(SPLIT(input, ' ')) x
WHERE
  STARTS_WITH(x,'#')

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173028

Below is for BigQuery Standard SQL

I want to extract all the words in this column that start with a #

#standardSQL
WITH temp AS (
  SELECT 1 AS id,"as we go forward into unchartered waters it's important to remember we are all in this together. #united #community" AS input UNION ALL
  SELECT 2 , "US cities close bars, restaurants and cinemas #Coronavirus"
)
SELECT id, word
FROM temp, UNNEST(REGEXP_EXTRACT_ALL(input, r'(?:^|\s)#([^#\s]*)')) word   

with output

Row id  word     
1   1   united   
2   1   community    
3   2   Coronavirus    

later on I would like to get the frequency of these terms

#standardSQL
SELECT word, COUNT(1) frequency
FROM temp, UNNEST(REGEXP_EXTRACT_ALL(input, r'(?:^|\s)#([^#\s]*)')) word
GROUP BY word

Upvotes: 1

Related Questions