Nikolay Novozhlov
Nikolay Novozhlov

Reputation: 365

BigQuery running totals

I'm having troubles making running totals work for me in BigQuery.

I've found an example that works here: BigQuery SQL running totals

SELECT word, word_count, SUM(word_count) OVER(ORDER BY word DESC)
FROM [publicdata:samples.shakespeare]
WHERE corpus  = 'hamlet'
AND word > 'a' LIMIT 30

But what I really want to do - is calculate number of most popular words that cover 80% of total word_count. So I tried to calculate running total while ordering by word_count first:

SELECT word, word_count, SUM(word_count) OVER(ORDER BY word_count DESC)
FROM [publicdata:samples.shakespeare]
WHERE corpus  = 'hamlet'
AND word > 'a' LIMIT 30

But I get this:

Row word    word_count  f0_  
1   o'er    18          18   
2   answer  13          31   
3   meet    8           39   
4   told    5           44   
5   treason 4           **52**   
6   quality 4           **52**   
7   brave   3           55  

The running total is not increased from line 5 to line 6. Probably because in both cases word_count is 4.

What am I doing wrong?

Maybe there is a better way? My plan was to calculate running total. Then divide it by sum(word_count) OVER() and filter only rows with less than 80%. Then count number of these rows.

Upvotes: 2

Views: 2818

Answers (1)

Felipe Hoffa
Felipe Hoffa

Reputation: 59155

First, remove "LIMIT 30" - it will interfere with the OVER() clause.

You want a ratio? Try RATIO_TO_REPORT:

SELECT word, word_count, RATIO_TO_REPORT(word_count) OVER(ORDER BY word_count DESC)
FROM [publicdata:samples.shakespeare]
WHERE corpus  = 'hamlet'
AND word > 'a' 

You want consecutive rows with equal values to increase anyways? Decide an order for those rows, with a secondary order:

SELECT word, word_count, RATIO_TO_REPORT(word_count) OVER(ORDER BY word_count DESC, word)
FROM [publicdata:samples.shakespeare]
WHERE corpus  = 'hamlet'
AND word > 'a' 

You want the most popular words that cover the 80%? Take those ratios, sum them up, and filter out the rest:

SELECT word, word_count, sum_ratio
FROM (
 SELECT word, word_count, SUM(ratio) OVER(ORDER BY ratio, word) sum_ratio
 FROM (
    SELECT word, word_count, RATIO_TO_REPORT(word_count) OVER(ORDER BY word_count DESC, word) ratio
    FROM [publicdata:samples.shakespeare]
    WHERE corpus  = 'hamlet'
    AND word > 'a' 
 )
)
WHERE sum_ratio>0.8

Row word    word_count  sum_ratio    
1   is      313         0.8125175752219499   
2   it      361         0.827019644076648    
3   in      400         0.8430884184308841   
4   my      441         0.8608042421564295   
5   you     499         0.8808500381633391   
6   of      630         0.906158357771261    
7   to      635         0.9316675370586108   
8   and     706         0.9600289237938375   
9   the     995         0.9999999999999999  

Upvotes: 3

Related Questions