Reputation: 33
Is there any way in Bigquery to group by not the absolute value but a range of values?
I have a query that looks in a product table with 4 different numeric group by's. What I am looking for is an efficient way to group by in a way like: group by "A±1000" etc. or "A±10%ofA".
thanks in advance,
Upvotes: 1
Views: 6667
Reputation: 173046
Below example is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.example` AS (
SELECT * FROM
UNNEST([STRUCT<id INT64, price FLOAT64>
(1, 15), (2, 50), (3, 125), (4, 150), (5, 175), (6, 250)
])
)
SELECT
CASE
WHEN price > 0 AND price <= 100 THEN ' 0 - 100'
WHEN price > 100 AND price <= 200 THEN '100 - 200'
ELSE '200+'
END AS range_group,
COUNT(1) AS cnt
FROM `project.dataset.example`
GROUP BY range_group
-- ORDER BY range_group
with result
Row range_group cnt
1 0 - 100 2
2 100 - 200 3
3 200+ 1
As you can see, in above solution you need construct CASE statement to reflect your ranges - if you have multiple - this can be quite boring - so below is more generic (but more verbose) solution - and it uses recently introduced RANGE_BUCKET function
#standardSQL
WITH `project.dataset.example` AS (
SELECT * FROM
UNNEST([STRUCT<id INT64, price FLOAT64>
(1, 15), (2, 50), (3, 125), (4, 150), (5, 175), (6, 250)
])
), ranges AS (
SELECT [100.0, 200.0] ranges_array
), temp AS (
SELECT OFFSET, IF(prev_val = val, CONCAT(prev_val, ' - '), CONCAT(prev_val, ' - ', val)) rng FROM (
SELECT OFFSET, IFNULL(CAST(LAG(val) OVER(ORDER BY OFFSET) AS STRING), '') prev_val, CAST(val AS STRING) AS val
FROM ranges, UNNEST(ARRAY_CONCAT(ranges_array, [ARRAY_REVERSE(ranges_array)[OFFSET(0)]])) val WITH OFFSET
)
)
SELECT
RANGE_BUCKET(price, ranges_array) range_group,
rng,
COUNT(1) AS cnt
FROM `project.dataset.example`, ranges
JOIN temp ON RANGE_BUCKET(price, ranges_array) = OFFSET
GROUP BY range_group, rng
-- ORDER BY range_group
with result
Row range_group rng cnt
1 0 - 100 2
2 1 100 - 200 3
3 2 200 - 1
As you can see, in second solution you need to define your your ranges in ranges
as simple array enlisting your boundaries as SELECT [100.0, 200.0] ranges_array
Then temp
does all needed calculation
Upvotes: 3
Reputation: 59225
You can do math operations on the GROUP BY
, creating groups by any arbitrary criteria.
For example:
WITH data AS (
SELECT repo.name, COUNT(*) price
FROM `githubarchive.month.201909`
GROUP BY 1
HAVING price>100
)
SELECT FORMAT('range %i-%i', MIN(price), MAX(price)) price_range, COUNT(*) c
FROM data
GROUP BY CAST(LOG(price) AS INT64)
ORDER BY MIN(price)
Upvotes: 0
Reputation: 5518
You can generate a column as a "named range" then group by the column. As an example for your A+-1000
case:
with data as (
select 100 as v union all
select 200 union all
select 2000 union all
select 2100 union all
select 2200 union all
select 4100 union all
select 8000 union all
select 8000
)
select count(v), ARRAY_AGG(v), ranges
FROM data, unnest([0, 2000, 4000, 6000, 8000]) ranges
WHERE data.v >= ranges - 1000 AND data.v < ranges + 1000
GROUP BY ranges
Output:
+-----+------------------------+--------+
| f0_ | f1_ | ranges |
+-----+------------------------+--------+
| 2 | ["100","200"] | 0 |
| 3 | ["2000","2100","2200"] | 2000 |
| 1 | ["4100"] | 4000 |
| 2 | ["8000","8000"] | 8000 |
+-----+------------------------+--------+
Upvotes: 5