Reputation: 287
what I'd like to do is recreate the bigram data from the available/public trigram data on BigQuery. Along the way, I'd like to trim down the data. It's hard because there seems to be a list of data within a single row, for example, cell.value is a column name that contains all the years, and it could have 100 elements in it, and all of that is in one row.
The columns I'd like are something like this:
ngram, first, second, third, cell.match_count*modified
where the modified last column is the sum of all match counts from 2000-2008 (ignoring all the older data). I suspect this would greatly reduce the size of the file (along with a few other tweaks).
The code I have so far is (and I have to run 2 separate queries for this)
SELECT ngram, cell.value, cell.match_count
FROM [publicdata:samples.trigrams]
WHERE ngram = "I said this"
AND cell.value in ("2000","2001","2002","2003","2004","2005","2006","2007","2008")
SELECT ngram, SUM(cell.match_count) as total
FROM [one_syllable.test]
GROUP BY ngram
The result is 2 columns with 1 row of data: I said this, 1181
But I'd like to get this for every ngram before I do some more trimming
How can I combine the queries so it's done all at once and also return the columns first
, second
, and third
?
Thanks!
PS I've tried
SELECT ngram, cell.value, cell.match_count
FROM [publicdata:samples.trigrams]
WHERE cell.value in ("2000","2001","2002","2003","2004","2005","2006","2007","2008")
But I get an error "response too large to return"...
Upvotes: 2
Views: 295
Reputation: 1
Google BIGQUERY now has arrays on the free trigrams dataset and the original answer needs to be modified to flatten the array (cell in this case) by using the UNNEST function. Modified sample SQL code below.
SELECT t1.ngram, t1.first, t1.second, t1.third, SUM(c.match_count)
from bigquery-public-data.samples.trigrams t1, UNNEST(cell) as c
WHERE {"2000","2001","2002","2003","2004","2005","2006","2007","2008"} IN
UNNEST(c.value)
GROUP BY 1,2,3,4;
Upvotes: 0
Reputation: 1277
The error "response too large to return" means that you will have to write the results to a destination table, with "Allow Large Results" checked. BigQuery won't return more than 128MB directly without using a destination table.
You should be able to generate the table you want using some aggregation functions. Try "GROUP EACH BY ngram" to aggregate in parallel and use the FIRST function to pick a single value from the first, second and third columns. It would look something like this:
SELECT ngram, FIRST(first), FIRST(second), FIRST(third), SUM(cell.match_count)
FROM [publicdata:samples.trigrams]
WHERE cell.value in ("2000","2001","2002","2003","2004","2005","2006","2007","2008")
GROUP EACH BY ngram;
Upvotes: 3