Statistical Mode with postgres

Question

I have a table that has this schema:

create table mytable (creation_date timestamp,
                      value int,
                      category int);

I want the maximum ocurrence of a value every each hour per category, Only on week days. I had made some progress, I have a query like this now:

select category,foo.h as h,value, count(value) from mytable, (
      select date_trunc('hour', 
                        '2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo 
      where date_part('hour',creation_date) = date_part('hour',foo.h) and
            date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
      group by category,h,value;

as result I got something like this:

category |    h     |  value  | count 
---------+----------+---------+-------
       1 | 00:00:00 |       2 |     1
       1 | 01:00:00 |       2 |     1
       1 | 02:00:00 |       2 |     6
       1 | 03:00:00 |       2 |    31
       1 | 03:00:00 |       3 |    11
       1 | 04:00:00 |       2 |    21
       1 | 04:00:00 |       3 |     9
       1 | 13:00:00 |       1 |    14
       1 | 14:00:00 |       1 |    10
       1 | 14:00:00 |       2 |     7
       1 | 15:00:00 |       1 |    52

for example at 04:00 I have to values 2 and 3, with counts of 21 and 9 respectively, I only need the value with highest count which would be the statiscal mode.

BTW I have more than 2M records

Erwin Brandstetter · Accepted Answer

This can be simpler:

SELECT DISTINCT ON (category, extract(hour FROM creation_date)::int)
       category
     , extract(hour FROM creation_date)::int AS h
     , count(*)::int AS max_ct
     , value
FROM   mytable
WHERE  extract(isodow FROM creation_date) < 6 -- no sat or sun
GROUP  BY 1,2,4
ORDER  BY 1,2,3 DESC;

Basically these are the steps:

Exclude weekends (WHERE ...). Use ISODOW to simplify the expression.
Extract hour from timestamp as h.
Group by category, h and value.
Count the rows per combination of the three; cast to integer - we don't need bigint.
Order by category, h and the highest count (DESC).
Only pick the first row (highest count) per (category, h) with the according category.

I am able to do this in one query level, because DISTINCT is applied after the aggregate function.

The result will hold no rows for any (category, h) without no entries at all. If you need to fill in the blanks, LEFT JOIN to this:

SELECT c.category, h.h
FROM   cat_tbl c
CROSS  JOIN (SELECT generate_series(0, 23) AS h) h

Statistical Mode with postgres

Answers (2)

Basically these are the steps:

Related Questions