sqlpostgresqlaggregate-functionsdate-arithmeticwindow-functions

Reputation: 2558

Summarize values across timeline in SQL

The Problem

I have a PostgreSQL database on which I am trying to summarize the revenue of a cash register over time. The cash register can either have status ACTIVE or INACTIVE, but I only want to summarize the earnings created when it was ACTIVE for a given period of time.

I have two tables; one that marks the revenue and one that marks the cash register status:

CREATE TABLE counters
(
  id bigserial NOT NULL,
  "timestamp" timestamp with time zone,
  total_revenue bigint,
  id_of_machine character varying(50),
  CONSTRAINT counters_pkey PRIMARY KEY (id)
)

CREATE TABLE machine_lifecycle_events
(
  id bigserial NOT NULL,
  event_type character varying(50),
  "timestamp" timestamp with time zone,
  id_of_affected_machine character varying(50),
  CONSTRAINT machine_lifecycle_events_pkey PRIMARY KEY (id)
)

A counters entry is added every 1 minute and total_revenue only increases. A machine_lifecycle_events entry is added every time the status of the machine changes.

I have added an image illustrating the problem. It is the revenue during the blue periods which should be summarized.

Timeline showing problem.

What I have tried so far

I have created a query which can give me the total revenue in a given instant:

SELECT total_revenue 
  FROM counters 
 WHERE timestamp < '2014-03-05 11:00:00' 
       AND id_of_machine='1' 
ORDER BY 
       timestamp desc 
 LIMIT 1

The questions

How do I calculate the revenue earned between two timestamps?
How do I determine the start and end timestamps of the blue periods when I have to compare the timestamps in machine_lifecycle_events with the input period?

Any ideas on how to attack this problem?

Update

Example data:

INSERT INTO counters VALUES
   (1,  '2014-03-01 00:00:00', 100,  '1')
 , (2,  '2014-03-01 12:00:00', 200,  '1')
 , (3,  '2014-03-02 00:00:00', 300,  '1')
 , (4,  '2014-03-02 12:00:00', 400,  '1')
 , (5,  '2014-03-03 00:00:00', 500,  '1')
 , (6,  '2014-03-03 12:00:00', 600,  '1')
 , (7,  '2014-03-04 00:00:00', 700,  '1')
 , (8,  '2014-03-04 12:00:00', 800,  '1')
 , (9,  '2014-03-05 00:00:00', 900,  '1')
 , (10, '2014-03-05 12:00:00', 1000, '1')
 , (11, '2014-03-06 00:00:00', 1100, '1')
 , (12, '2014-03-06 12:00:00', 1200, '1')
 , (13, '2014-03-07 00:00:00', 1300, '1')
 , (14, '2014-03-07 12:00:00', 1400, '1');

INSERT INTO machine_lifecycle_events VALUES
   (1, 'ACTIVE',   '2014-03-01 08:00:00', '1')
 , (2, 'INACTIVE', '2014-03-03 00:00:00', '1')
 , (3, 'ACTIVE',   '2014-03-05 00:00:00', '1')
 , (4, 'INACTIVE', '2014-03-06 12:00:00', '1');

SQL Fiddle with sample data.

Example query:
The revenue between '2014-03-02 08:00:00' and '2014-03-06 08:00:00' is 300. 100 for the first ACTIVE period, and 200 for the second ACTIVE period.

Upvotes: 7

Answers (3)

Erwin Brandstetter

Reputation: 656471

DB design

To make my work easier I sanitized your DB design before I tackled the questions:

CREATE TEMP TABLE counter (
    id            bigserial PRIMARY KEY
  , ts            timestamp NOT NULL
  , total_revenue bigint NOT NULL
  , machine_id    int NOT NULL
);

CREATE TEMP TABLE machine_event (
    id            bigserial PRIMARY KEY
  , ts            timestamp NOT NULL
  , machine_id    int NOT NULL
  , status_active bool NOT NULL
);

Test case in the fiddle.

Major points

Using ts instead of "timestamp". Never use basic type names as column names.
Simplified & unified the name machine_id and made it out to be integer as it should be, instead of varchar(50).
event_type varchar(50) should be an integer foreign key, too, or an enum. Or even just a boolean for only active / inactive. Simplified to status_active bool.
Simplified and sanitized INSERT statements as well.

Answers

Assumptions

total_revenue only increases (per question).
Borders of the outer time frame are included.
Every "next" row per machine in machine_event has the opposite status_active.

1. How do I calculate the revenue earned between two timestamps?

WITH span AS (
   SELECT '2014-03-02 12:00'::timestamp AS s_from  -- start of time range
        , '2014-03-05 11:00'::timestamp AS s_to    -- end of time range
   )
SELECT machine_id, s.s_from, s.s_to
     , max(total_revenue) - min(total_revenue) AS earned
FROM   counter c
     , span s
WHERE  ts BETWEEN s_from AND s_to                  -- borders included!
AND    machine_id =  1
GROUP  BY 1,2,3;

2. How do I determine the start and end timestamps of the blue periods when I have to compare the timestamps in machine_event with the input period?

This query for all machines in the given time frame (span).
Add WHERE machine_id = 1 in the CTE cte to select a specific machine.

WITH span AS (
   SELECT '2014-03-02 08:00'::timestamp AS s_from  -- start of time range
        , '2014-03-06 08:00'::timestamp AS s_to    -- end of time range
   )
, cte AS (
   SELECT machine_id, ts, status_active, s_from
        , lead(ts, 1, s_to) OVER w AS period_end
        , first_value(ts)   OVER w AS first_ts
   FROM   span          s
   JOIN   machine_event e ON e.ts BETWEEN s.s_from AND s.s_to
   WINDOW w AS (PARTITION BY machine_id ORDER BY ts)
   )
SELECT machine_id, ts AS period_start, period_end -- start in time frame
FROM   cte
WHERE  status_active

UNION  ALL                             -- active start before time frame
SELECT machine_id, s_from, ts
FROM   cte
WHERE  NOT status_active
AND    ts =  first_ts
AND    ts <> s_from

UNION  ALL       -- active start before time frame, no end in time frame
SELECT machine_id, s_from, s_to
FROM  (
   SELECT DISTINCT ON (1)
          e.machine_id, e.status_active, s.s_from, s.s_to
   FROM   span          s
   JOIN   machine_event e ON e.ts < s.s_from  -- only from before time range
   LEFT   JOIN cte c USING (machine_id)
   WHERE  c.machine_id IS NULL                -- not in selected time range
   ORDER  BY e.machine_id, e.ts DESC          -- only the latest entry
   ) sub
WHERE  status_active -- only if active
ORDER  BY 1, 2;

Result is the list of blue periods in your image.
SQL Fiddle demonstrating both.

Recent similar question:
Sum of time difference between rows

Upvotes: 2

zealot

Reputation: 247

Use self-join and build intervals table with actual status of each interval.

with intervals as (
    select e1.timestamp time1, e2.timestamp time2, e1.EVENT_TYPE as status
    from machine_lifecycle_events e1
    left join machine_lifecycle_events e2 on e2.id = e1.id + 1
) select * from counters c
join intervals i on (timestamp between i.time1 and i.time2 or i.time2 is null) 
    and i.status = 'ACTIVE';

I didn't use aggregation to show the result set, you can do this simply, I think. Also I missed machineId to simplify demonstration of this pattern.

Upvotes: 0

peter

Reputation: 15089

ok, I have an answer, but I had to assume that the id of the machine_lifecycle_events can be used to determine accessor and predecessor. So for my solution to work better you should have a link between the active and inactive events. There might be also other ways to solve it but those would add even more complexity.

first, to get the revenue for all active periods per machine you can do the following:

select c.id_of_machine, cycle_id, cycle_start, cycle_end, sum(total_revenue)
from counters c join (
    select e1.id as cycle_id, 
           e1.timestamp as cycle_start, 
           e2.timestamp as cycle_end,
           e1.id_of_affected_machine as cycle_machine_id
    from machine_lifecycle_events e1 join machine_lifecycle_events e2 
        on e1.id + 1 = e2.id and -- this should be replaced with a specific column to find cycles which belong together
           e1.id_of_affected_machine = e2.id_of_affected_machine
    where e1.event_type = 'ACTIVE'
        ) cycle
    on c.id_of_machine = cycle_machine_id and 
       cycle_start <= c.timestamp and c.timestamp <= cycle_end
group by c.id_of_machine, cycle_id, cycle_start, cycle_end
order by c.id_of_machine, cycle_id

you can further use this query and add more where conditions to get the revenue only within a time frame or for specific machines:

select sum(total_revenue)
from counters c join (
    select e1.id as cycle_id, 
           e1.timestamp as cycle_start, 
           e2.timestamp as cycle_end,
           e1.id_of_affected_machine as cycle_machine_id
    from machine_lifecycle_events e1 join machine_lifecycle_events e2 
        on e1.id + 1 = e2.id and -- this should be replaced with a specific column to find cycles which belong together
           e1.id_of_affected_machine = e2.id_of_affected_machine
    where e1.event_type = 'ACTIVE'
        ) cycle
    on c.id_of_machine = cycle_machine_id and 
   cycle_start <= c.timestamp and c.timestamp <= cycle_end
where '2014-03-02 08:00:00' <= c.timestamp and c.timestamp <= '2014-03-06 08:00:00'
    and c.id_of_machine = '1'

As mentioned in the beginning, and in the comments, my way of finding connecting events isn't suitable for any more complex examples with multiple machines. The easiest way would be to have another column which would always point to the preceding event. Another way would be to have a function which would find those events but this solution couldn't make use of indices.