DaHoopster
DaHoopster

Reputation: 602

What's the best way to get the IDs of rows by percentile for bigquery?

I am working with the public dataset of wikipedia in BigQuery. I am trying to find the ids of the wikipedia articles with character count ranked in the 75th, 80th, 85th and 90th percentile.

I am now aware of the QUANTILES function that returns the quantiles, but how can I retrieve all the IDs?

Upvotes: 1

Views: 1043

Answers (1)

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

How about:

SELECT a.id 
FROM [publicdata:samples.wikipedia] a
CROSS JOIN (
 SELECT NTH(75, QUANTILES(num_characters, 100)) amin, NTH(76, QUANTILES(num_characters, 100)) amax
 FROM [publicdata:samples.wikipedia]) b
WHERE a.num_characters > b.amin AND a.num_characters < b.amax;

What are you doing with the ids in the 76th, 86th, etc., percentiles? You should be able to adapt the previous query depending on your exact needs.

(For more efficiency you should extract the QUANTILES query results to a new table instead of computing it each time - queries will run faster and cheaper)

Upvotes: 1

Related Questions