Paco Valdez
Paco Valdez

Reputation: 1975

Statistical Mode with postgres

I have a table that has this schema:

create table mytable (creation_date timestamp,
                      value int,
                      category int);

I want the maximum ocurrence of a value every each hour per category, Only on week days. I had made some progress, I have a query like this now:

select category,foo.h as h,value, count(value) from mytable, (
      select date_trunc('hour', 
                        '2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo 
      where date_part('hour',creation_date) = date_part('hour',foo.h) and
            date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
      group by category,h,value;

as result I got something like this:

category |    h     |  value  | count 
---------+----------+---------+-------
       1 | 00:00:00 |       2 |     1
       1 | 01:00:00 |       2 |     1
       1 | 02:00:00 |       2 |     6
       1 | 03:00:00 |       2 |    31
       1 | 03:00:00 |       3 |    11
       1 | 04:00:00 |       2 |    21
       1 | 04:00:00 |       3 |     9
       1 | 13:00:00 |       1 |    14
       1 | 14:00:00 |       1 |    10
       1 | 14:00:00 |       2 |     7
       1 | 15:00:00 |       1 |    52

for example at 04:00 I have to values 2 and 3, with counts of 21 and 9 respectively, I only need the value with highest count which would be the statiscal mode.

BTW I have more than 2M records

Upvotes: 0

Views: 1378

Answers (2)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656616

This can be simpler:

SELECT DISTINCT ON (category, extract(hour FROM creation_date)::int)
       category
     , extract(hour FROM creation_date)::int AS h
     , count(*)::int AS max_ct
     , value
FROM   mytable
WHERE  extract(isodow FROM creation_date) < 6 -- no sat or sun
GROUP  BY 1,2,4
ORDER  BY 1,2,3 DESC;

Basically these are the steps:

  1. Exclude weekends (WHERE ...). Use ISODOW to simplify the expression.
  2. Extract hour from timestamp as h.
  3. Group by category, h and value.
  4. Count the rows per combination of the three; cast to integer - we don't need bigint.
  5. Order by category, h and the highest count (DESC).
  6. Only pick the first row (highest count) per (category, h) with the according category.

I am able to do this in one query level, because DISTINCT is applied after the aggregate function.

The result will hold no rows for any (category, h) without no entries at all. If you need to fill in the blanks, LEFT JOIN to this:

SELECT c.category, h.h
FROM   cat_tbl c
CROSS  JOIN (SELECT generate_series(0, 23) AS h) h

Upvotes: 2

Given the size of your table, I'd be tempted to use your query to build a temporary table, then run a query on that to finalise the results.

Assuming you called the temporary table "summary_table", the following query should do it.

select
    category, h, value, count
from
    summary_table s1
where
    not exists 
    (select * from summary_table s2
        where s1.category = s2.category and
                  s1.h = s2.h and
                   (s1.count < s2.count 
            OR (s1.count = s2.count and s1.value > s2.value));

If you don't want to create a table, you could use a WITH clause to attach your query to this one.

with summary_table as (
select category,foo.h as h,value, count(value) as count from mytable, (
      select date_trunc('hour', 
                        '2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo 
      where date_part('hour',creation_date) = date_part('hour',foo.h) and
            date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
      group by category,h,value)
select
    category, h, value, count
from
    summary_table s1
where
    not exists 
    (select * from summary_table s2
        where s1.category = s1.category and
                  s1.h = s2.h and
                   (s1.count < s2.count 
            OR (s1.count = s2.count and s1.value > s2.value));

Upvotes: 1

Related Questions