Andy Davies
Andy Davies

Reputation: 5824

Populating empty bins in a histogram generated using SQL

In Redshift I can create a histogram – in this case it's binning a column named metric into 100ms buckets

select floor(metric / 100) * 100 as bin, count(*) as impressions
from tablename
where epoch > date_part(epoch, dateadd(day, -1, sysdate))
and metric is not null
group by bin
order by bin

There's a danger that some of the bins might be empty and won't appear in the result set, so I want to use generate_series to create the empty bins e.g.

select *, 0 as impressions from generate_series(0, maxMetricValue, 100) as bin

and union the two sets of results together to produce the 'full' histogram

select bin, sum(impressions)
from
    (
        select floor(metric/100)*100 as bin, count(*) as impressions
        from tablename
        where epoch > date_part(epoch, dateadd(day, -1, sysdate))
        and metric is not null
        group by bin
        order by bin
    )
union
    (
        select *, 0 as impressions from generate_series(0, maxMetricValue, 100) as bin
    )
oroup by bin
order by bin

The challenge is that calculating the maxMetricValue requires a subquery i.e. select max(metric)… etc and I'd like to avoid that

Is there a way I can calculate the max value from the histogram query and use that instead?

Edit:

Something like this seems along the right lines but Redshift doesn't like it

with histogram as (
  select cast(floor(metric/100)*100 as integer) as bin, count(*) as impressions
    from table name
    and epoch > date_part(epoch, dateadd(day, -1, sysdate))
    and metric is not null
    group by bin
    order by bin)
select bin, sum(impressions)
from (
  select * from histogram
  union
  select *, 0 as impressions from generate_series(0, (select max(bin) from histogram), 100) as bin
  )
group by bin
order by bin

I get this error, but there are no INFO messages visible: ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.

If I remove the cast I get: ERROR: function generate_series(integer, double precision, integer) does not exist Hint: No function matches the given name and argument types. You may need to add explicit type casts.

If I try using cast or convert in the parameter for generate_series I get the first error again!

Edit 2:

Presume the above query is failing because Redshift is trying to execute generate_series on a compute node rather than a leader but not sure

Upvotes: 2

Views: 545

Answers (1)

Bill Weiner
Bill Weiner

Reputation: 11042

First off generate_series is a leader-only function and will throw an error when used in combo with user data. A recursive CTE is the way to do this but since this isn't what you want I won't get into it.

You could create a numbers table and calculate the min, max and count from the other data you know. You could then outer join on some condition that will never match.

However i expect you will be much better off the union all you already have.

Upvotes: 0

Related Questions