return1.at
return1.at

Reputation: 3140

find every n-th date in a continuous date stream

i would like to find/mark every 4th day in a continuous date stream inserted into my table for each user in a given date range

CREATE TABLE mytable (
  id INTEGER,
  myuser INTEGER,            
  day DATE NOT NULL,
  PRIMARY KEY  (id)      
);

the problem is, that only 3 continuous days are valid per user, after that, there has to be a one day "break"

 id  | myuser |    day     |
-----+--------+------------+
  0  |    200 | 2012-01-12 | }
  1  |    200 | 2012-01-13 | }--> 3 continuous days
  2  |    200 | 2012-01-14 | }
  3  |    200 | 2012-01-15 | <-- not ok, user 200 should get warned and delete this
  4  |    200 | 2012-01-16 | }
  5  |    200 | 2012-01-17 | }--> 3 continuous days
  6  |    200 | 2012-01-18 | }
  7  |    200 | 2012-01-19 | <-- not ok, user 200 should get warned and delete this
  8  |    201 | 2012-01-12 | }
  9  |    201 | 2012-01-13 | }--> 3 continuous days
  10 |    201 | 2012-01-14 | }
  11 |    201 | 2012-01-16 | <-- ok, there is a one day gap here
  12 |    201 | 2012-01-17 | 

the main goal is to look at a given date range (usually a month) and identify days, which are not allowed. Also i have to take care that the overlapping dates are handled correctly, for example, if i look on a date range from 2012-02-01 to 2012-02-29, 2012-02-01 could be a "break" day if 2012-01-29 to 2012-01-31 is present in that table for the same user.

Upvotes: 4

Views: 312

Answers (2)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 657972

Much simpler and faster with the window function lag():

SELECT myuser
      ,day
      ,COALESCE(lag(day, 3) OVER (PARTITION BY myuser ORDER BY day) = (day - 3)
                ,FALSE) AS break_overdue
FROM   mytable
WHERE  day BETWEEN ('2012-01-12'::date - 3) AND '2012-01-16'::date;

Result:

 myuser |    day     | break_overdue
--------+------------+---------------
    200 | 2012-01-12 | f
    200 | 2012-01-13 | f
    200 | 2012-01-14 | f
    200 | 2012-01-15 | t
    200 | 2012-01-16 | t
    201 | 2012-01-12 | f
    201 | 2012-01-13 | f
    201 | 2012-01-14 | f
    201 | 2012-01-16 | f

Major points:

  • The query marks all days as break_overdue after three consecutive days. It is unclear whether you want all of them marked after the rule has been broken or just every 4th day.

  • I include 3 days before the start date (not just two) to determine whether the first day is already in violation of the rule.

  • The test is simple: if the 3rd row before the current row within the partition equals the current day - 3 then the rule has been broken. I wrap it all in COALESCE to fold NULL values to FALSE for cosmetic reasons only. Guaranteed to work as long as (myuser, day) is unique.
    In PostgreSQL you can subtract integers form a date, effectively subtracting days.

  • Can be done in a single query level, no CTE or subquery needed. Should be much faster.

  • You need PostgreSQL 8.4 or later for window functions.

Upvotes: 2

MatBailie
MatBailie

Reputation: 86765

I don't have access to PostgreSQL, but hopefully this works...

WITH
  grouped_data AS
(
  SELECT
    ROW_NUMBER() OVER (PARTITION BY myuser ORDER BY day) - (day - start_date) AS user_group_id,
    myuser,
    day
  FROM
    myTable
  WHERE
        day >= start_date - 3
    AND day <= end_date
)
,
  sequenced_data AS
(
  SELECT
    ROW_NUMBER() OVER (PARTITION BY myuser, user_group_id ORDER BY day) AS sequence_id,
    myuser,
    day
  FROM
    grouped_data
)
SELECT
  myuser,
  day,
  CASE WHEN sequence_id % 4 = 0 THEN 1 ELSE 0 END as should_be_a_break_day
FROM
  sequenced_data
WHERE
  day >= start_date


Sorry I didn't explain the workings, I had to jump into a meeting :)

Example with start_date = '2012-01-14'...

 id | myuser |    day     | ROW_NUMBER() | day - start_date | user_group_id
----+--------+------------+--------------+------------------+---------------
  0 |    200 | 2012-01-12 |     1        |        -2        |  1 - -2 = 3
  1 |    200 | 2012-01-13 |     2        |        -1        |  2 - -1 = 3
  2 |    200 | 2012-01-14 |     3        |         0        |  3 -  0 = 3
  3 |    200 | 2012-01-15 |     4        |         1        |  4 -  1 = 3
  4 |    200 | 2012-01-16 |     5        |         2        |  5 -  2 = 3
----+--------+------------+--------------+------------------+---------------
  5 |    201 | 2012-01-12 |     1        |        -2        |  1 - -2 = 3
  6 |    201 | 2012-01-13 |     2        |        -1        |  2 - -1 = 3
  7 |    201 | 2012-01-14 |     3        |         0        |  3 - -1 = 3
  8 |    201 | 2012-01-16 |     4        |         2        |  4 -  2 = 2

Any sequential dates will have the same user_group_id. Each 'gap' in the days makes that user_group_id decrease by 1 (see row 8, if the record was for the 17th, a 2 day gap, the id would have been 1).

Once you have a group_id, row_number() can be easily used to say which day in the sequence it is. A max of 3 day is the same as "Every 4th day should be a gap", and "x % 4 = 0" identifies every 4th day.

Upvotes: 5

Related Questions