thatandrey
thatandrey

Reputation: 287

Sort public trigram data in BigQuery

what I'd like to do is recreate the bigram data from the available/public trigram data on BigQuery. Along the way, I'd like to trim down the data. It's hard because there seems to be a list of data within a single row, for example, cell.value is a column name that contains all the years, and it could have 100 elements in it, and all of that is in one row.

The columns I'd like are something like this:

ngram, first, second, third, cell.match_count*modified

where the modified last column is the sum of all match counts from 2000-2008 (ignoring all the older data). I suspect this would greatly reduce the size of the file (along with a few other tweaks).

The code I have so far is (and I have to run 2 separate queries for this)

SELECT ngram, cell.value, cell.match_count
FROM [publicdata:samples.trigrams]
WHERE ngram = "I said this"
AND cell.value in ("2000","2001","2002","2003","2004","2005","2006","2007","2008")

SELECT ngram, SUM(cell.match_count) as total
FROM [one_syllable.test]
GROUP BY ngram

The result is 2 columns with 1 row of data: I said this, 1181

But I'd like to get this for every ngram before I do some more trimming

How can I combine the queries so it's done all at once and also return the columns first, second, and third ?

Thanks!

PS I've tried

SELECT ngram, cell.value, cell.match_count
FROM [publicdata:samples.trigrams]
WHERE cell.value in ("2000","2001","2002","2003","2004","2005","2006","2007","2008")

But I get an error "response too large to return"...

Upvotes: 2

Views: 295

Answers (2)

LUFC999
LUFC999

Reputation: 1

Google BIGQUERY now has arrays on the free trigrams dataset and the original answer needs to be modified to flatten the array (cell in this case) by using the UNNEST function. Modified sample SQL code below.

SELECT t1.ngram, t1.first, t1.second, t1.third, SUM(c.match_count)
from bigquery-public-data.samples.trigrams t1, UNNEST(cell) as c
WHERE {"2000","2001","2002","2003","2004","2005","2006","2007","2008"} IN 
UNNEST(c.value)
GROUP BY 1,2,3,4;

Upvotes: 0

sprocket
sprocket

Reputation: 1277

The error "response too large to return" means that you will have to write the results to a destination table, with "Allow Large Results" checked. BigQuery won't return more than 128MB directly without using a destination table.

You should be able to generate the table you want using some aggregation functions. Try "GROUP EACH BY ngram" to aggregate in parallel and use the FIRST function to pick a single value from the first, second and third columns. It would look something like this:

SELECT ngram, FIRST(first), FIRST(second), FIRST(third), SUM(cell.match_count)
FROM [publicdata:samples.trigrams]
WHERE cell.value in ("2000","2001","2002","2003","2004","2005","2006","2007","2008")
GROUP EACH BY ngram;

Upvotes: 3

Related Questions