Reputation: 59175
I'm looking at reddit comments. I'm using some common stopword lists, but I want to create a custom one for this dataset. How can I do this with SQL?
Upvotes: 0
Views: 425
Reputation: 59175
One approach to identify stopwords is to look at the ones that show up in most documents.
Steps in this query:
r'[a-z]{1,20}\'?[a-z]+'
). Query:
#standardSQL
WITH words_by_post AS (
SELECT CONCAT(link_id, '/', id) id, REGEXP_EXTRACT_ALL(
REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body), '&', '&'), r'&[a-z]{2,4};', '*')
, r'[a-z]{1,20}\'?[a-z]+') words
FROM `fh-bigquery.reddit_comments.2017_07`
WHERE body NOT IN ('[deleted]', '[removed]')
AND subreddit IN ('AskReddit', 'funny', 'movies')
AND score > 100
), words_per_doc AS (
SELECT id, word
FROM words_by_post, UNNEST(words) word
WHERE ARRAY_LENGTH(words) > 20
GROUP BY id, word
)
SELECT word, COUNT(*) docs_with_word
FROM words_per_doc
GROUP BY 1
ORDER BY docs_with_word DESC
LIMIT 100
Upvotes: 1