Reputation: 1975
I have a table that has this schema:
create table mytable (creation_date timestamp,
value int,
category int);
I want the maximum ocurrence of a value every each hour per category, Only on week days. I had made some progress, I have a query like this now:
select category,foo.h as h,value, count(value) from mytable, (
select date_trunc('hour',
'2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo
where date_part('hour',creation_date) = date_part('hour',foo.h) and
date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
group by category,h,value;
as result I got something like this:
category | h | value | count
---------+----------+---------+-------
1 | 00:00:00 | 2 | 1
1 | 01:00:00 | 2 | 1
1 | 02:00:00 | 2 | 6
1 | 03:00:00 | 2 | 31
1 | 03:00:00 | 3 | 11
1 | 04:00:00 | 2 | 21
1 | 04:00:00 | 3 | 9
1 | 13:00:00 | 1 | 14
1 | 14:00:00 | 1 | 10
1 | 14:00:00 | 2 | 7
1 | 15:00:00 | 1 | 52
for example at 04:00 I have to values 2 and 3, with counts of 21 and 9 respectively, I only need the value with highest count which would be the statiscal mode.
BTW I have more than 2M records
Upvotes: 0
Views: 1378
Reputation: 656616
This can be simpler:
SELECT DISTINCT ON (category, extract(hour FROM creation_date)::int)
category
, extract(hour FROM creation_date)::int AS h
, count(*)::int AS max_ct
, value
FROM mytable
WHERE extract(isodow FROM creation_date) < 6 -- no sat or sun
GROUP BY 1,2,4
ORDER BY 1,2,3 DESC;
WHERE ...
). Use ISODOW to simplify the expression.hour
from timestamp
as h
.category
, h
and value
.integer
- we don't need bigint
.category
, h
and the highest count (DESC
).(category, h)
with the according category
.I am able to do this in one query level, because DISTINCT
is applied after the aggregate function.
The result will hold no rows for any (category, h)
without no entries at all. If you need to fill in the blanks, LEFT JOIN
to this:
SELECT c.category, h.h
FROM cat_tbl c
CROSS JOIN (SELECT generate_series(0, 23) AS h) h
Upvotes: 2
Reputation: 7541
Given the size of your table, I'd be tempted to use your query to build a temporary table, then run a query on that to finalise the results.
Assuming you called the temporary table "summary_table", the following query should do it.
select
category, h, value, count
from
summary_table s1
where
not exists
(select * from summary_table s2
where s1.category = s2.category and
s1.h = s2.h and
(s1.count < s2.count
OR (s1.count = s2.count and s1.value > s2.value));
If you don't want to create a table, you could use a WITH clause to attach your query to this one.
with summary_table as (
select category,foo.h as h,value, count(value) as count from mytable, (
select date_trunc('hour',
'2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo
where date_part('hour',creation_date) = date_part('hour',foo.h) and
date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
group by category,h,value)
select
category, h, value, count
from
summary_table s1
where
not exists
(select * from summary_table s2
where s1.category = s1.category and
s1.h = s2.h and
(s1.count < s2.count
OR (s1.count = s2.count and s1.value > s2.value));
Upvotes: 1