Reputation: 1826

Use google bigquery to build histogram graph

How can write a query that makes histogram graph rendering easier?

For example, we have 100 million people with ages, we want to draw the histogram/buckets for age 0-10, 11-20, 21-30 etc... What does the query look like?

Has anyone done it? Did you try to connect the query result to google spreadsheet to draw the histogram?

Upvotes: 20

Answers (9)

Aleksey Adamovich

Reputation: 39

Take a look at the custom SQL functions. It works as

to_bin(10, [0, 100, 500]) => '... - 100'
to_bin(1000, [0, 100, 500, 0]) => '500 - ...'
to_bin(1000, [0, 100, 500]) => NULL

Any ideas and commits are welcomed

Upvotes: 0

Martin Weitzmann

Reputation: 4746

I found gamars approach quite intriguing and expanded a little bit on it using scripting instead of the cross join. Notably, this approach also allows to consistently change group sizes, like here with group sizes that increase exponentially.

declare stats default 
    (select as struct min(new_confirmed) as min, max(new_confirmed) as max 
     from `bigquery-public-data.covid19_open_data.covid19_open_data` 
     where new_confirmed >0 and date = date "2022-03-07"
    );
declare group_amount default 10; -- change group amount here

SELECT 
    CAST(floor(
        (ln(new_confirmed-stats.min+1)/ln(stats.max-stats.min+1)) * (group_amount-1)) 
      AS INT64) group_flag,
    concat('[',min(new_confirmed),',',max(new_confirmed),']') as group_value_range,
    count(1) as quantity
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where new_confirmed >0 and date = date "2022-03-07"
GROUP BY group_flag
ORDER BY group_flag ASC

The basic approach is to label each value with its group_flag and then group by it. The flag is calculated by scaling the value down to a value between 0 and 1 and then scale it up again to 0 - group_amount.

I just took the log of the corrected value and range before their division to get the desired bias in group sizes. I also add 1 to make sure it doesn't try to take the log of 0.

Upvotes: 1

Jordan Tigani

Reputation: 26637

See the 2019 update, with #standardSQL --Fh

The subquery idea works, as does "CASE WHEN" and then doing a group by:

SELECT COUNT(field1), bucket 
FROM (
    SELECT field1, CASE WHEN age >=  0 AND age < 10 THEN 1
                        WHEN age >= 10 AND age < 20 THEN 2
                        WHEN age >= 20 AND age < 30 THEN 3
                        ...
                        ELSE -1 END as bucket
    FROM table1) 
GROUP BY bucket

Alternately, if the buckets are regular -- you could just divide and cast to an integer:

SELECT COUNT(field1), bucket 
FROM (
    SELECT field1, INTEGER(age / 10) as bucket FROM table1)
GROUP BY bucket

Upvotes: 13

Felipe Hoffa

Reputation: 59225

With #standardSQL and an auxiliary stats query, we can define the range the histogram should look into.

Here for the time to fly between SFO and JFK - with 10 buckets:

WITH data AS ( 
    SELECT *, ActualElapsedTime datapoint
    FROM `fh-bigquery.flights.ontime_201903`
    WHERE FlightDate_year = "2018-01-01" 
    AND Origin = 'SFO' AND Dest = 'JFK'
)
, stats AS (
  SELECT min+step*i min, min+step*(i+1)max
  FROM (
    SELECT max-min diff, min, max, (max-min)/10 step, GENERATE_ARRAY(0, 10, 1) i
    FROM (
      SELECT MIN(datapoint) min, MAX(datapoint) max
      FROM data
    )
  ), UNNEST(i) i
)

SELECT COUNT(*) count, (min+max)/2 avg
FROM data 
JOIN stats
ON data.datapoint >= stats.min AND data.datapoint<stats.max
GROUP BY avg
ORDER BY avg

If you need round numbers, see: https://stackoverflow.com/a/60159876/132438

Upvotes: 11

nocibambi

Reputation: 2421

There is now the APPROX_QUANTILES aggregation function in standard SQL.

SELECT
    APPROX_QUANTILES(column, number_of_bins)
...

Upvotes: 1

Balazs Gunics

Reputation: 2067

Write a subquery like this:

(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0)

Then you can do something like this:

SELECT * FROM
(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0),
(SELECT '2' AS agegroup, count(*) FROM people WHERE AGE <= 20 AND AGE >= 10),
(SELECT '3' AS agegroup, count(*) FROM people WHERE AGE <= 120 AND AGE >= 20)

Result will be like:

Row agegroup count 
1   1       somenumber
2   2       somenumber
3   3       another number

I hope this helps you. Of course in the age group you can write anything like: '0 to 10'

Upvotes: 1

gamars

Reputation: 257

Using a cross join to get your min and max values (not that expensive on a single tuple) you can get a normalized bucket list of any given bucket count:

select
  min(data.VAL) as min,
  max(data.VAL) as max,
  count(data.VAL) as num,
  integer((data.VAL-value.min)/(value.max-value.min)*8) as group
from [table] data
CROSS JOIN (SELECT MAX(VAL) as max, MIN(VAL) as min, from [table]) value
GROUP BY group
ORDER BY group

in this example we're getting 8 buckets (pretty self explanatory) plus one for null VAL

Upvotes: 2

Matt Faus

Reputation: 6681

You could also use the quantiles aggregation operator to get a quick look at the distribution of ages.

SELECT
  quantiles(age, 10)
FROM mytable

Each row of this query would correspond to the age at that point in the list of ages. The first result is the age 1/10ths of the way through the sorted list of ages, the second is the age 2/10ths through, 3/10ths, etc.

Upvotes: 21

mdahlman

Reputation: 9400

You're looking for a single vector of information. I would normally query it like this:

select
  count(*) as num,
  integer( age / 10 ) as age_group
from mytable
group by age_group

A big case statement will be needed for arbitrary groups. It would be simple but much longer. My example should be fine if every bucket contains N years.

Upvotes: 0

Use google bigquery to build histogram graph

Answers (9)

Related Questions