Reputation: 2199
I would like to find the total word count of a given word in the publicly accessible BigQuery database on "shakespeare" (under samples -> shakespeare, also known as bigquery-public-data.samples.shakespeare
).
The schema looks like this:
Field name | Type | Mode | Description
---------------------------------------------------
word | STRING | REQUIRED | A single unique word (where whitespace is the delimiter) extracted from a corpus.
word_count | INTEGER | REQUIRED | The number of times this word appears in this corpus.
corpus | STRING | REQUIRED | The work from which this word was extracted.
corpus_date| INTEGER | REQUIRED | The year in which this corpus was published.
I have successfully combined all words, lowercased, across all values of corpus
, and then combined all corpus
instances into a new column, found_in
, while also SUM
ing their word counts into the column total_word_count
.
My query looks like this:
SELECT
STRING_AGG(DISTINCT corpus) AS found_in,
LOWER(word),
SUM(word_count) AS total_word_count
FROM
`bigquery-public-data.samples.shakespeare`
GROUP BY
LOWER(word)
ORDER BY
total_word_count DESC
LIMIT
1000
The output columns are
Row found_in f0_ total_word_count
My issue is renaming the f0_
column. This is an issue because I'd like to wrap this whole thing in another query, so I can do something like SELECT * FROM {{that previous query}} WHERE word="thou"
.
What I don't understand is
How to reference "word," in my WHERE
clause.
How to name the LOWER(word)
portion of my primary query, as I did for STRING_AGG
and (SUM) (using AS
).
I tried the following:
SELECT
*
FROM
(
SELECT
STRING_AGG(DISTINCT corpus) AS found_in,
LOWER(word),
SUM(word_count) AS total_word_count
FROM
`bigquery-public-data.samples.shakespeare`
GROUP BY
LOWER(word)
ORDER BY
total_word_count DESC
LIMIT
1000
)
WHERE word = 'thou'
However, I got an error on the final line: Unrecognized name: word
.
So, I tried to use AS
:
SELECT
*
FROM
(
SELECT
STRING_AGG(DISTINCT corpus) AS found_in,
LOWER(word) AS lowered_word,
SUM(word_count) AS total_word_count
FROM
`bigquery-public-data.samples.shakespeare`
GROUP BY
LOWER(word)
ORDER BY
total_word_count DESC
LIMIT
1000
)
WHERE word = 'and'
But then I got the error SELECT list expression references column word which is neither grouped nor aggregated
, on the line with LOWER(word)
.
This is confusing to me, because I see word
referenced by the GROUP BY
.
How can I properly refer to LOWER(word) so as to refer to it in a secondary querty?
Upvotes: 0
Views: 262
Reputation: 1270713
I think this is what you want:
SELECT *
FROM (SELECT STRING_AGG(DISTINCT corpus) AS found_in,
LOWER(word) AS lowered_word,
SUM(word_count) AS total_word_count
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY lowered_word
ORDER BY total_word_count DESC
LIMIT 1000
) w
WHERE lowered_word = 'and';
Notes:
word
produced by the subquery, so use lowered_word
for the outer comparison.LIMIT
in the subquery seems arbitrary. I don't think it improves performance or lowers the cost of the query.Upvotes: 2