sqlpostgresqlgreatest-n-per-grouppostgresql-8.3

Reputation: 4346

How to select more than 1 record per day?

This is a postgresql problem.

PostgreSQL 8.3.3 on i686-redhat-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-9).

The table looks like:

date_time           other_column
2012-11-01 00:00:00 ...
2012-11-02 01:00:00 ...
2012-11-02 02:00:00 ...
2012-11-02 03:00:00 ...
2012-11-02 04:00:00 ...
2012-11-03 05:00:00 ...
2012-11-03 06:00:00 ...
2012-11-05 00:00:00 ...
2012-11-07 00:00:00 ...
2012-11-07 00:00:00 ...
...

I want to select at most 3 records per day from a specific date range.

For example, I want to select at most 3 records from 2012-11-02 to 2012-11-05. The expected result would be:

date_time           other_column
2012-11-02 01:00:00 ...
2012-11-02 02:00:00 ...
2012-11-02 03:00:00 ...
2012-11-03 05:00:00 ...
2012-11-03 06:00:00 ...
2012-11-05 00:00:00 ...

I have spent a few hours on this and still cannot figure it out. Please help me. :(

UPDATE: The current sql I tried could only select one record per day:

SELECT DISTINCT ON (TO_DATE(SUBSTRING((date_time || '') FROM 1 FOR 10), 'YYYY-MM-DD')) *
FROM myTable
WHERE  date_time >=  '20121101 00:00:00'  
AND  date_time <= '20121130 23:59:59'

Upvotes: 3

Answers (3)

Erwin Brandstetter

Reputation: 656401

I want to select at most 3 records per day from a specific date range.

SELECT date_time, other_column
FROM  (
   SELECT *, row_number() OVER (PARTITION BY date_time::date) AS rn
   FROM   tbl
   WHERE  date_time >= '2012-11-01 0:0'
   AND    date_time <  '2012-12-01 0:0'
   ) x
WHERE  rn < 4;

Major points

Use the window function row_number(). rank() or dense_rank() would be wrong according to the question - more than 3 records might be selected with timestamp duplicates.
Since you do not define which rows you want per day, the correct answer is not to include an ORDER BY clause in the window function. Gives you an arbitrary selection, which matches the question.
I changed your WHERE clause from
```
WHERE  date_time >= '20121101 00:00:00'  
AND    date_time <= '20121130 23:59:59'
```
to
```
WHERE  date_time >=  '2012-11-01 0:0'  
AND    date_time <   '2012-12-01 0:0'
```
Your syntax would fail for corner cases like '20121130 23:59:59.123'.

What @Craig suggested:
```
date_time::date BETWEEN '2012-11-02' AND '2012-11-05'
```
.. would work correctly, but is an anti-pattern regarding performance. If you apply a cast or a function to your database column in the expression, plain indexes cannot be used.

Solution for PostgreSQL 8.3

Best solution: Upgrade to a more recent version, preferably to the current version 9.2.

Other solutions:

For only few days you could employ UNION ALL:

SELECT date_time, other_column
FROM   tbl t1
WHERE  date_time >= '2012-11-01 0:0'
AND    date_time <  '2012-11-02 0:0'
LIMIT  3
)
UNION ALL 
(
SELECT date_time, other_column
FROM   tbl t1
WHERE  date_time >= '2012-11-02 0:0'
AND    date_time <  '2012-11-03 0:0'
LIMIT  3
)
...

Parenthesis are not optional here.

For more days there are workarounds with generate_series() - something like I posted here (including a link to more).

I might have solved it with a plpgsql function back in the old days before we had window functions:

CREATE OR REPLACE FUNCTION x.f_foo (date, date, integer
                         , OUT date_time timestamp, OUT other_column text)
  RETURNS SETOF record AS
$BODY$
DECLARE
    _last_day date;          -- remember last day
    _ct       integer := 1;  -- count
BEGIN

FOR date_time, other_column IN
   SELECT t.date_time, t.other_column
   FROM   tbl t
   WHERE  t.date_time >= $1::timestamp
   AND    t.date_time <  ($2 + 1)::timestamp
   ORDER  BY t.date_time::date
LOOP
   IF date_time::date = _last_day THEN
      _ct := _ct + 1;
   ELSE
      _ct := 1;
   END IF;

   IF _ct <= $3 THEN
      RETURN NEXT;
   END IF;

   _last_day := date_time::date;
END LOOP;

END;
$BODY$ LANGUAGE plpgsql STABLE STRICT;

COMMENT ON FUNCTION f_foo(date3, date, integer) IS 'Return n rows per day
$1 .. date_from (incl.)
$2 .. date_to  (incl.)
$3 .. maximim rows per day';

Call:

SELECT * FROM f_foo('2012-11-01', '2012-11-05', 3);

Upvotes: 3

Craig Ringer

Reputation: 324375

The following answers all use date_trunc('day',date_time) or just cast to date to truncate a timestamp to a date. There's no need to jump through hoops with date formatting and strings. See Date/time functions in the manual.

This SQLFiddle shows three possible answers: http://sqlfiddle.com/#!12/0fd51/14, all of which produce the same result for the input data (but not necessarily the same result if date_time can have duplicates in it).

To solve your problem you could use a correlated subquery with a limit to generate an IN-list to filter on:

SELECT a.date_time, a.other_column
FROM table1 a
WHERE a.date_time IN (
  SELECT b.date_time
  FROM table1 b
  WHERE b.date_time IS NOT NULL
    AND a.date_time::date = b.date_time::date
  ORDER BY b.date_time
  LIMIT 3
)
AND a.date_time::date BETWEEN '2012-11-02' AND '2012-11-05';

This should be the most portable approach - though it won't work with MySQL (at least as of 5.5) because MySQL doesn't support LIMIT in a subquery used in an IN clause. It works in SQLite3 and PostgreSQL, though, and should work in most other DBs.

Another option would be to select the range of dates you wanted, annotate the rows within the range with a row number using a window function, then filter the output to exclude excess rows:

SELECT date_time, other_column
FROM (
  SELECT 
    date_time, 
    other_column, 
    rank() OVER (PARTITION BY date_trunc('day',date_time) ORDER BY date_time) AS n
  FROM Table1
  WHERE date_trunc('day',date_time) BETWEEN '2012-11-02' AND '2012-11-05'
  ORDER BY date_time
) numbered_rows
WHERE n < 4;

If ties are a possibility, ie if date_time is not unique, then consider using either the rank or dense_rank window functions instead of row_number to get deterministic results, or add an additional clause to the ORDER BY in row_number to break the tie.

If you use rank then it'll include none of the rows if it can't fit all of them in; if you use dense_rank it'll include all of them even if it has to go over the 3-row-per-day limit to do so.

All sorts of other processing are possible this way too, using the window specification.

Here's yet another formulation that uses array aggregation and slicing, which is completely PostgreSQL specific but fun.

SELECT b.date_time, b.other_column 
FROM (
  SELECT array_agg(a.date_time ORDER BY a.date_time)
  FROM table1 a
  WHERE a.date_time::date BETWEEN '2012-11-02' 
    AND '2012-11-05'
  GROUP BY a.date_time::date
) x(arr) 
INNER JOIN table1 b ON (b.date_time = ANY (arr[1:3]));

Upvotes: 3

Tiit

Reputation: 520

I would use a sub-select and a left outer join. This should do the trick:

select distinct(date_format(a.date_time,"%Y-%m-%d")) date_time, b.* from table a
left outer join (
  select date_format(date_time,"%Y-%m-%d") dt, * from table limit 3
) b 
on date_format(a.date_time,"%Y-%m-%d") = b.dt;

Upvotes: -3

How to select more than 1 record per day?

Answers (3)

Major points

Solution for PostgreSQL 8.3

Related Questions