Reputation: 2558
I have a PostgreSQL database on which I am trying to summarize the revenue of a cash register over time. The cash register can either have status ACTIVE or INACTIVE, but I only want to summarize the earnings created when it was ACTIVE for a given period of time.
I have two tables; one that marks the revenue and one that marks the cash register status:
CREATE TABLE counters
(
id bigserial NOT NULL,
"timestamp" timestamp with time zone,
total_revenue bigint,
id_of_machine character varying(50),
CONSTRAINT counters_pkey PRIMARY KEY (id)
)
CREATE TABLE machine_lifecycle_events
(
id bigserial NOT NULL,
event_type character varying(50),
"timestamp" timestamp with time zone,
id_of_affected_machine character varying(50),
CONSTRAINT machine_lifecycle_events_pkey PRIMARY KEY (id)
)
A counters entry is added every 1 minute and total_revenue only increases. A machine_lifecycle_events entry is added every time the status of the machine changes.
I have added an image illustrating the problem. It is the revenue during the blue periods which should be summarized.
I have created a query which can give me the total revenue in a given instant:
SELECT total_revenue
FROM counters
WHERE timestamp < '2014-03-05 11:00:00'
AND id_of_machine='1'
ORDER BY
timestamp desc
LIMIT 1
Any ideas on how to attack this problem?
Example data:
INSERT INTO counters VALUES
(1, '2014-03-01 00:00:00', 100, '1')
, (2, '2014-03-01 12:00:00', 200, '1')
, (3, '2014-03-02 00:00:00', 300, '1')
, (4, '2014-03-02 12:00:00', 400, '1')
, (5, '2014-03-03 00:00:00', 500, '1')
, (6, '2014-03-03 12:00:00', 600, '1')
, (7, '2014-03-04 00:00:00', 700, '1')
, (8, '2014-03-04 12:00:00', 800, '1')
, (9, '2014-03-05 00:00:00', 900, '1')
, (10, '2014-03-05 12:00:00', 1000, '1')
, (11, '2014-03-06 00:00:00', 1100, '1')
, (12, '2014-03-06 12:00:00', 1200, '1')
, (13, '2014-03-07 00:00:00', 1300, '1')
, (14, '2014-03-07 12:00:00', 1400, '1');
INSERT INTO machine_lifecycle_events VALUES
(1, 'ACTIVE', '2014-03-01 08:00:00', '1')
, (2, 'INACTIVE', '2014-03-03 00:00:00', '1')
, (3, 'ACTIVE', '2014-03-05 00:00:00', '1')
, (4, 'INACTIVE', '2014-03-06 12:00:00', '1');
Example query:
The revenue between '2014-03-02 08:00:00' and '2014-03-06 08:00:00' is 300. 100 for the first ACTIVE period, and 200 for the second ACTIVE period.
Upvotes: 7
Views: 1802
Reputation: 656471
To make my work easier I sanitized your DB design before I tackled the questions:
CREATE TEMP TABLE counter (
id bigserial PRIMARY KEY
, ts timestamp NOT NULL
, total_revenue bigint NOT NULL
, machine_id int NOT NULL
);
CREATE TEMP TABLE machine_event (
id bigserial PRIMARY KEY
, ts timestamp NOT NULL
, machine_id int NOT NULL
, status_active bool NOT NULL
);
ts
instead of "timestamp". Never use basic type names as column names.machine_id
and made it out to be integer
as it should be, instead of varchar(50)
.event_type varchar(50)
should be an integer
foreign key, too, or an enum
. Or even just a boolean
for only active / inactive. Simplified to status_active bool
.INSERT
statements as well.total_revenue only increases
(per question).machine_event
has the opposite status_active
.1. How do I calculate the revenue earned between two timestamps?
WITH span AS (
SELECT '2014-03-02 12:00'::timestamp AS s_from -- start of time range
, '2014-03-05 11:00'::timestamp AS s_to -- end of time range
)
SELECT machine_id, s.s_from, s.s_to
, max(total_revenue) - min(total_revenue) AS earned
FROM counter c
, span s
WHERE ts BETWEEN s_from AND s_to -- borders included!
AND machine_id = 1
GROUP BY 1,2,3;
2. How do I determine the start and end timestamps of the blue periods when I have to compare the timestamps in
machine_event
with the input period?
This query for all machines in the given time frame (span
).
Add WHERE machine_id = 1
in the CTE cte
to select a specific machine.
WITH span AS (
SELECT '2014-03-02 08:00'::timestamp AS s_from -- start of time range
, '2014-03-06 08:00'::timestamp AS s_to -- end of time range
)
, cte AS (
SELECT machine_id, ts, status_active, s_from
, lead(ts, 1, s_to) OVER w AS period_end
, first_value(ts) OVER w AS first_ts
FROM span s
JOIN machine_event e ON e.ts BETWEEN s.s_from AND s.s_to
WINDOW w AS (PARTITION BY machine_id ORDER BY ts)
)
SELECT machine_id, ts AS period_start, period_end -- start in time frame
FROM cte
WHERE status_active
UNION ALL -- active start before time frame
SELECT machine_id, s_from, ts
FROM cte
WHERE NOT status_active
AND ts = first_ts
AND ts <> s_from
UNION ALL -- active start before time frame, no end in time frame
SELECT machine_id, s_from, s_to
FROM (
SELECT DISTINCT ON (1)
e.machine_id, e.status_active, s.s_from, s.s_to
FROM span s
JOIN machine_event e ON e.ts < s.s_from -- only from before time range
LEFT JOIN cte c USING (machine_id)
WHERE c.machine_id IS NULL -- not in selected time range
ORDER BY e.machine_id, e.ts DESC -- only the latest entry
) sub
WHERE status_active -- only if active
ORDER BY 1, 2;
Result is the list of blue periods in your image.
SQL Fiddle demonstrating both.
Recent similar question:
Sum of time difference between rows
Upvotes: 2
Reputation: 247
Use self-join and build intervals table with actual status of each interval.
with intervals as (
select e1.timestamp time1, e2.timestamp time2, e1.EVENT_TYPE as status
from machine_lifecycle_events e1
left join machine_lifecycle_events e2 on e2.id = e1.id + 1
) select * from counters c
join intervals i on (timestamp between i.time1 and i.time2 or i.time2 is null)
and i.status = 'ACTIVE';
I didn't use aggregation to show the result set, you can do this simply, I think. Also I missed machineId to simplify demonstration of this pattern.
Upvotes: 0
Reputation: 15089
ok, I have an answer, but I had to assume that the id of the machine_lifecycle_events
can be used to determine accessor and predecessor. So for my solution to work better you should have a link between the active and inactive events. There might be also other ways to solve it but those would add even more complexity.
first, to get the revenue for all active periods per machine you can do the following:
select c.id_of_machine, cycle_id, cycle_start, cycle_end, sum(total_revenue)
from counters c join (
select e1.id as cycle_id,
e1.timestamp as cycle_start,
e2.timestamp as cycle_end,
e1.id_of_affected_machine as cycle_machine_id
from machine_lifecycle_events e1 join machine_lifecycle_events e2
on e1.id + 1 = e2.id and -- this should be replaced with a specific column to find cycles which belong together
e1.id_of_affected_machine = e2.id_of_affected_machine
where e1.event_type = 'ACTIVE'
) cycle
on c.id_of_machine = cycle_machine_id and
cycle_start <= c.timestamp and c.timestamp <= cycle_end
group by c.id_of_machine, cycle_id, cycle_start, cycle_end
order by c.id_of_machine, cycle_id
you can further use this query and add more where conditions to get the revenue only within a time frame or for specific machines:
select sum(total_revenue)
from counters c join (
select e1.id as cycle_id,
e1.timestamp as cycle_start,
e2.timestamp as cycle_end,
e1.id_of_affected_machine as cycle_machine_id
from machine_lifecycle_events e1 join machine_lifecycle_events e2
on e1.id + 1 = e2.id and -- this should be replaced with a specific column to find cycles which belong together
e1.id_of_affected_machine = e2.id_of_affected_machine
where e1.event_type = 'ACTIVE'
) cycle
on c.id_of_machine = cycle_machine_id and
cycle_start <= c.timestamp and c.timestamp <= cycle_end
where '2014-03-02 08:00:00' <= c.timestamp and c.timestamp <= '2014-03-06 08:00:00'
and c.id_of_machine = '1'
As mentioned in the beginning, and in the comments, my way of finding connecting events isn't suitable for any more complex examples with multiple machines. The easiest way would be to have another column which would always point to the preceding event. Another way would be to have a function which would find those events but this solution couldn't make use of indices.
Upvotes: 0