Caleb Jay
Caleb Jay

Reputation: 2199

How do I properly "AS" a column that has the "LOWER" function applied, so as to refer to it in a wrapping "SELECT", in BigQuery SQL syntax?

I would like to find the total word count of a given word in the publicly accessible BigQuery database on "shakespeare" (under samples -> shakespeare, also known as bigquery-public-data.samples.shakespeare).

The schema looks like this:

Field name | Type     | Mode      |  Description
---------------------------------------------------
word       | STRING   | REQUIRED  | A single unique word (where whitespace is the delimiter) extracted from a corpus.
word_count | INTEGER  | REQUIRED  | The number of times this word appears in this corpus.
corpus     | STRING   | REQUIRED  | The work from which this word was extracted.
corpus_date| INTEGER  | REQUIRED  | The year in which this corpus was published.

I have successfully combined all words, lowercased, across all values of corpus, and then combined all corpus instances into a new column, found_in, while also SUMing their word counts into the column total_word_count.

My query looks like this:

SELECT
  STRING_AGG(DISTINCT corpus) AS found_in,
  LOWER(word),
  SUM(word_count) AS total_word_count
FROM
  `bigquery-public-data.samples.shakespeare`
GROUP BY
  LOWER(word)
ORDER BY
  total_word_count DESC
LIMIT
  1000

The output columns are

Row     found_in    f0_     total_word_count 

My issue is renaming the f0_ column. This is an issue because I'd like to wrap this whole thing in another query, so I can do something like SELECT * FROM {{that previous query}} WHERE word="thou".

What I don't understand is

  1. How to reference "word," in my WHERE clause.

  2. How to name the LOWER(word) portion of my primary query, as I did for STRING_AGG and (SUM) (using AS).

I tried the following:

SELECT
* 
FROM
(
SELECT
  STRING_AGG(DISTINCT corpus) AS found_in,
  LOWER(word),
  SUM(word_count) AS total_word_count
FROM
  `bigquery-public-data.samples.shakespeare`
GROUP BY
  LOWER(word)
ORDER BY
  total_word_count DESC
LIMIT
  1000
)
WHERE word = 'thou'

However, I got an error on the final line: Unrecognized name: word.

So, I tried to use AS:

SELECT
* 
FROM
(
SELECT
  STRING_AGG(DISTINCT corpus) AS found_in,
  LOWER(word) AS lowered_word,
  SUM(word_count) AS total_word_count
FROM
  `bigquery-public-data.samples.shakespeare`
GROUP BY
  LOWER(word)
ORDER BY
  total_word_count DESC
LIMIT
  1000
)
WHERE word = 'and'

But then I got the error SELECT list expression references column word which is neither grouped nor aggregated, on the line with LOWER(word).

This is confusing to me, because I see word referenced by the GROUP BY.

How can I properly refer to LOWER(word) so as to refer to it in a secondary querty?

Upvotes: 0

Views: 262

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1270713

I think this is what you want:

SELECT * 
FROM (SELECT STRING_AGG(DISTINCT corpus) AS found_in,
             LOWER(word) AS lowered_word,
             SUM(word_count) AS total_word_count
      FROM `bigquery-public-data.samples.shakespeare`
      GROUP BY lowered_word
      ORDER BY total_word_count DESC
      LIMIT 1000
     ) w
WHERE lowered_word = 'and';

Notes:

  • There is nothing called word produced by the subquery, so use lowered_word for the outer comparison.
  • You can aggregate in BigQuery by column aliases.
  • The LIMIT in the subquery seems arbitrary. I don't think it improves performance or lowers the cost of the query.

Upvotes: 2

Related Questions