postgresqlsql-order-byselectionrowswindow-functions

Reputation:

PostgreSQL: select nearest rows according to sort order

I have a table like this:

     a    |  user_id
----------+-------------
  0.1133  |  2312882332
  4.3293  |  7876123213
  3.1133  |  2312332332
  1.3293  |  7876543213
  0.0033  |  2312222332
  5.3293  |  5344343213
  3.2133  |  4122331112
  2.3293  |  9999942333

And I want to locate a particular row - 1.3293 | 7876543213 for example - and select the nearest 4 rows. 2 above, 2 below if possible.
Sort order is ORDER BY a ASC.

In this case I will get:

  0.0033  |  2312222332
  0.1133  |  2312882332
  2.3293  |  9999942333
  3.1133  |  2312332332

How can I achieve this using PostgreSQL? (BTW, I'm using PHP.)

P.S.: For the last or first row the nearest rows would be 4 above or 4 below.

Upvotes: 3

Answers (3)

Erwin Brandstetter

Reputation: 656824

Test case:

CREATE TEMP TABLE tbl(a float, user_id bigint);
INSERT INTO tbl VALUES
 (0.1133, 2312882332)
,(4.3293, 7876123213)
,(3.1133, 2312332332)
,(1.3293, 7876543213)
,(0.0033, 2312222332)
,(5.3293, 5344343213)
,(3.2133, 4122331112)
,(2.3293, 9999942333);

Query:

WITH x AS (
    SELECT a
          ,user_id
          ,row_number() OVER (ORDER BY a, user_id) AS rn
    FROM   tbl
    ), y AS (
    SELECT rn, LEAST(rn - 3, (SELECT max(rn) - 5 FROM x)) AS min_rn
    FROM   x
    WHERE  (a, user_id) = (1.3293, 7876543213)
    )
SELECT *
FROM   x, y
WHERE  x.rn  > y.min_rn
AND    x.rn <> y.rn
ORDER  BY x.a, x.user_id
LIMIT  4;

Returns result as depicted in the question. Assuming that (a, user_id) is unique.

It is not clear whether a is supposed to unique. That's why I sort by user_id additionally to break ties. That's also why I use the window function row_number(), an not rank() for this. row_number() is the correct tool in any case. We want 4 rows. rank() would give an undefined number of rows if there were peers in the sort order.

This always returns 4 rows as long as there are at least 5 rows in the table. Close to first / last row, the first / last 4 rows are returned. The two rows before / after in all other cases. The criteria row itself is excluded.

Improved performance

This is an improved version of what @Tim Landscheidt posted. Vote for his answer if you like the idea with the index. Don't bother with small tables. But will boost performance for big tables - provided you have a fitting index in place. Best choice would be a multicolumn index on (a, user_id).

WITH params(_a, _user_id) AS (SELECT 5.3293, 5344343213) -- enter params once
    ,x AS  (
    (
    SELECT a
          ,user_id
          ,row_number() OVER (ORDER BY a DESC, user_id DESC) AS rn
    FROM   tbl, params p
    WHERE  a < p._a
       OR  a = p._a AND user_id < p._user_id -- a is not defined unique
    ORDER  BY a DESC, user_id DESC
    LIMIT  5  -- 4 + 1: including central row
    )
    UNION ALL -- UNION right away, trim one query level
    (
    SELECT a
          ,user_id
          ,row_number() OVER (ORDER BY a ASC, user_id ASC) AS rn
    FROM   tbl, params p
    WHERE  a > p._a
       OR  a = p._a AND user_id > p._user_id
    ORDER  BY a ASC, user_id ASC
    LIMIT  5
    )
    )
    , y AS (
    SELECT a, user_id
    FROM   x, params p
    WHERE (a, user_id) <> (p._a, p._user_id) -- exclude central row
    ORDER  BY rn  -- no need to ORDER BY a
    LIMIT  4
    )
SELECT *
FROM   y
ORDER  BY a, user_id   -- ORDER result as requested

Major differences to @Tim's version:

According to the question (a, user_id) form the search criteria, not just a. That changes window frame, ORDER BY and WHERE clause in subtly different ways.
UNION right away, no need for an extra query level. You need parenthesis around the two UNION-queries to allow for individual ORDER BY.
Sort result as requested. Requires another query level (at hardly any cost).
As parameters are used in multiple places I centralized the input in a leading CTE.
For repeated use you can wrap this query almost 'as is' into an SQL or plpgsql function.

Upvotes: 6

wildplasser

Reputation: 44250

set search_path='tmp';

DROP TABLE lutser;
CREATE TABLE lutser
        ( val float
        , num bigint
        );
INSERT INTO lutser(val, num)
VALUES ( 0.1133  ,  2312882332  )
      ,( 4.3293  ,  7876123213  )
      ,( 3.1133  ,  2312332332  )
      ,( 1.3293  ,  7876543213  )
      ,( 0.0033  ,  2312222332  )
      ,( 5.3293  ,  5344343213  )
      ,( 3.2133  ,  4122331112  )
      ,( 2.3293  ,  9999942333  )
        ;

WITH ranked_lutsers AS (
        SELECT val, num
        ,rank() OVER (ORDER BY val) AS rnk
        FROM lutser
        )
SELECT that.val, that.num
        , (that.rnk-this.rnk) AS relrnk
FROM ranked_lutsers that
JOIN ranked_lutsers this ON (that.rnk BETWEEN this.rnk-2 AND this.rnk+2)
WHERE this.val = 1.3293
        ;

Results:

DROP TABLE
CREATE TABLE
INSERT 0 8
  val   |    num     | relrnk 
--------+------------+--------
 0.0033 | 2312222332 |     -2
 0.1133 | 2312882332 |     -1
 1.3293 | 7876543213 |      0
 2.3293 | 9999942333 |      1
 3.1133 | 2312332332 |      2
(5 rows)

As Erwin pointed out, the center row is not wanted in the output. Also, the row_number() should be used instead of rank().

WITH ranked_lutsers AS (
        SELECT val, num
        -- ,rank() OVER (ORDER BY val) AS rnk
        , row_number() OVER (ORDER BY val, num) AS rnk
        FROM lutser
) SELECT that.val, that.num
        , (that.rnk-this.rnk) AS relrnk
FROM ranked_lutsers that
JOIN ranked_lutsers this ON (that.rnk BETWEEN this.rnk-2 AND this.rnk+2 )
WHERE this.val = 1.3293
AND that.rnk <> this.rnk
        ;

Result2:

  val   |    num     | relrnk 
--------+------------+--------
 0.0033 | 2312222332 |     -2
 0.1133 | 2312882332 |     -1
 2.3293 | 9999942333 |      1
 3.1133 | 2312332332 |      2
(4 rows)

UPDATE2: to always select four, even if we are at the top or bottom of the list. This makes the query a bit uglier. (but not as ugly as Erwin's ;-)

WITH ranked_lutsers AS (
        SELECT val, num
        -- ,rank() OVER (ORDER BY val) AS rnk
        , row_number() OVER (ORDER BY val, num) AS rnk
        FROM lutser
) SELECT that.val, that.num
        , ABS(that.rnk-this.rnk) AS srtrnk
        , (that.rnk-this.rnk) AS relrnk
FROM ranked_lutsers that
JOIN ranked_lutsers this ON (that.rnk BETWEEN this.rnk-4 AND this.rnk+4 )
-- WHERE this.val = 1.3293
WHERE this.val = 0.1133
AND that.rnk <> this.rnk
ORDER BY srtrnk ASC
LIMIT 4
        ;

Output:

  val   |    num     | srtrnk | relrnk 
--------+------------+--------+--------
 0.0033 | 2312222332 |      1 |     -1
 1.3293 | 7876543213 |      1 |      1
 2.3293 | 9999942333 |      2 |      2
 3.1133 | 2312332332 |      3 |      3
(4 rows)

UPDATE: A version with a nested CTE (featuring outer join!!!). For conveniance, I added a primary key to the table, which sounds like a good idea anyway IMHO.

WITH distance AS (
        WITH ranked_lutsers AS (
        SELECT id
        , row_number() OVER (ORDER BY val, num) AS rnk
        FROM lutser
        ) SELECT l0.id AS one
        ,l1.id AS two
        , ABS(l1.rnk-l0.rnk) AS dist
        -- Warning: Cartesian product below
        FROM ranked_lutsers l0
        , ranked_lutsers l1 WHERE l0.id <> l1.id

        )
SELECT lu.*
FROM lutser lu
JOIN distance di
ON lu.id = di.two
WHERE di.one= 1
ORDER by di.dist
LIMIT 4 
        ;

Upvotes: 0

Tim Landscheidt

Reputation: 1400

And another one:

WITH prec_rows AS
  (SELECT a,
          user_id,
          ROW_NUMBER() OVER (ORDER BY a DESC) AS rn
   FROM tbl
   WHERE a < 1.3293
   ORDER BY a DESC LIMIT 4),
     succ_rows AS
  (SELECT a,
          user_id,
          ROW_NUMBER() OVER (ORDER BY a ASC) AS rn
   FROM tbl
   WHERE a > 1.3293
   ORDER BY a ASC LIMIT 4)
SELECT a, user_id
FROM
  (SELECT a,
          user_id,
          rn
   FROM prec_rows
   UNION ALL SELECT a,
                    user_id,
                    rn
   FROM succ_rows) AS s
ORDER BY rn, a LIMIT 4;

AFAIR WITH will instantiate a memory table, so the focus of this solution is to limit its size as much as possible (in this case eight rows).

Upvotes: 2