Reputation: 365
I'm having troubles making running totals work for me in BigQuery.
I've found an example that works here: BigQuery SQL running totals
SELECT word, word_count, SUM(word_count) OVER(ORDER BY word DESC)
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'hamlet'
AND word > 'a' LIMIT 30
But what I really want to do - is calculate number of most popular words that cover 80% of total word_count. So I tried to calculate running total while ordering by word_count first:
SELECT word, word_count, SUM(word_count) OVER(ORDER BY word_count DESC)
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'hamlet'
AND word > 'a' LIMIT 30
But I get this:
Row word word_count f0_
1 o'er 18 18
2 answer 13 31
3 meet 8 39
4 told 5 44
5 treason 4 **52**
6 quality 4 **52**
7 brave 3 55
The running total is not increased from line 5 to line 6. Probably because in both cases word_count is 4.
What am I doing wrong?
Maybe there is a better way? My plan was to calculate running total. Then divide it by sum(word_count) OVER() and filter only rows with less than 80%. Then count number of these rows.
Upvotes: 2
Views: 2818
Reputation: 59155
First, remove "LIMIT 30" - it will interfere with the OVER() clause.
You want a ratio? Try RATIO_TO_REPORT:
SELECT word, word_count, RATIO_TO_REPORT(word_count) OVER(ORDER BY word_count DESC)
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'hamlet'
AND word > 'a'
You want consecutive rows with equal values to increase anyways? Decide an order for those rows, with a secondary order:
SELECT word, word_count, RATIO_TO_REPORT(word_count) OVER(ORDER BY word_count DESC, word)
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'hamlet'
AND word > 'a'
You want the most popular words that cover the 80%? Take those ratios, sum them up, and filter out the rest:
SELECT word, word_count, sum_ratio
FROM (
SELECT word, word_count, SUM(ratio) OVER(ORDER BY ratio, word) sum_ratio
FROM (
SELECT word, word_count, RATIO_TO_REPORT(word_count) OVER(ORDER BY word_count DESC, word) ratio
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'hamlet'
AND word > 'a'
)
)
WHERE sum_ratio>0.8
Row word word_count sum_ratio
1 is 313 0.8125175752219499
2 it 361 0.827019644076648
3 in 400 0.8430884184308841
4 my 441 0.8608042421564295
5 you 499 0.8808500381633391
6 of 630 0.906158357771261
7 to 635 0.9316675370586108
8 and 706 0.9600289237938375
9 the 995 0.9999999999999999
Upvotes: 3