Reputation: 51
I have a large list of comments from reddit. The strings are split into words, punctuation removed, and quantified to show the most used words on a particular subreddit:
SELECT word, COUNT(*) as num_words
FROM(FLATTEN((
SELECT SPLIT(LOWER(REGEXP_REPLACE(body, r'[\.\",*:()\[\]/|\n]', ' ')), ' ') word
FROM [fh-bigquery:reddit_comments.2017_08]
WHERE subreddit="The_Donald"
), word))
GROUP EACH BY word
HAVING num_words >= 1000
ORDER BY num_words DESC
I have a list of stopwords to remove, how would I go about adding that to the code? Thanks :)
Upvotes: 5
Views: 2037
Reputation: 172993
Below example is for BigQuery Legacy SQL (as it is in your question)
#legacydSQL
SELECT word, COUNT(*) AS num_words
FROM(FLATTEN((
SELECT SPLIT(LOWER(REGEXP_REPLACE(body, r'[\.\",*:()\[\]/|\n]', ' ')), ' ') word
FROM [fh-bigquery:reddit_comments.2017_08]
WHERE subreddit="The_Donald"
), word))
WHERE NOT word IN (
'the','to','a','and'
)
GROUP EACH BY word
HAVING num_words >= 1000
ORDER BY num_words DESC
BigQuery Team strongly recommends using Standard SQL
So if you will decide to migrate
- below is example in Standard SQL
It assumes you have your stop words in your_project.your_dataset.stop_words
table
#standardSQL
SELECT word, COUNT(*) AS num_words
FROM `fh-bigquery.reddit_comments.2017_08`,
UNNEST(SPLIT(LOWER(REGEXP_REPLACE(body, r'[\.\",*:()\[\]/|\n]', ' ')), ' ')) word
WHERE subreddit="The_Donald"
AND word NOT IN (SELECT stop_word FROM `your_project.your_dataset.stop_words`)
GROUP BY word
HAVING num_words >= 1000
AND word != ''
ORDER BY num_words DESC
You can test / play with below dummy data here
#standardSQL
WITH `your_project.your_dataset.stop_words` AS (
SELECT stop_word
FROM UNNEST(['the','to','a','and']) stop_word
)
SELECT word, COUNT(*) AS num_words
FROM `fh-bigquery.reddit_comments.2017_08`,
UNNEST(SPLIT(LOWER(REGEXP_REPLACE(body, r'[\.\",*:()\[\]/|\n]', ' ')), ' ')) word
WHERE subreddit="The_Donald"
AND word NOT IN (SELECT stop_word FROM `your_project.your_dataset.stop_words`)
GROUP BY word
HAVING num_words >= 1000
AND word != ''
ORDER BY num_words DESC
Upvotes: 5