Dizz
Dizz

Reputation: 43

Find first 3 orders for each customer

I am trying to find the first 3 timestamps for each customer.

Sample data for table customer_orders:

customer_id timestamp
6778 '2022-01-01'
6778 '2022-02-05'
5544 '2022-04-01'
6778 '2022-02-04'
5544 '2022-04-03'
5544 '2022-04-02'
5544 '2022-01-01'
6778 '2021-01-01'

Desired outcome:

customer_id timestamp
5544 '2022-01-01'
5544 '2022-04-01'
5544 '2022-04-02'
6778 '2021-01-01'
6778 '2022-01-01'
6778 '2022-04-02'

My query so far:

SELECT
    customer_id, 
    timestamp
FROM customer_orders
GROUP BY customer_id, timestamp
ORDER BY timestamp ASC
LIMIT 3

LIMIT 3 limits to 3 rows overall. But I want 3 rows per customer.

Upvotes: 1

Views: 1730

Answers (2)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656872

Faster window function

@Jim provided a valid solution. But there are subtle performance details. RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is the default window frame, spelling it out is just noise. The manual:

The default framing option is RANGE UNBOUNDED PRECEDING, which is the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

But row_number() operates on rows by definition, it's more efficient to use ROWS mode:

SELECT customer_id, timestamp
FROM  (
   SELECT row_number() OVER (PARTITION BY customer_id ORDER BY timestamp
                             ROWS UNBOUNDED PRECEDING) AS rn
        , customer_id, timestamp
   FROM   customer_orders
   ) sub
WHERE  rn <= 3 
ORDER  BY 1, 2;

An index like this is essential to sped it up:

CREATE UNIQUE INDEX ON customer_orders (customer_id, timestamp);

I ran extensive tests on Postgres 13 and 14 with ROWS vs. RANGE mode, and ROWS is consistently ~ 20% faster with an index-only scan. Quite a revelation for one of the most commonly used window functions!

fiddle

Optimized in Postgres 16

I eventually presented this case to the Postgres devs, and Postgres 16 has this optimization now built in! See:

Meaning, in Postgres 16 (going forward), a plain row_number() is just as fast. In earlier versions adding the frame clause ROWS UNBOUNDED PRECEDING makes it considerably faster.

Faster, yet

That said, if your table is big and there are many rows per customer, a different query style is much faster, yet - by orders of magnitude. We need the same index as above.

Ideally, you have a table customers with exactly one row per relevant customer_id. If you don't have it, create it. Then:

SELECT c.customer_id, o.timestamp
FROM   customers c
CROSS  JOIN LATERAL (
   SELECT timestamp
   FROM   customer_orders o
   WHERE  o.customer_id = c.customer_id
   ORDER  BY o.timestamp
   LIMIT  3
   ) o
ORDER  BY 1, 2;

fiddle

Related:

Upvotes: 0

Jim Jones
Jim Jones

Reputation: 19623

You can use the window function ROW_NUMBER() to numerate the columns of a given PARTITION (customer_id in your case) within a CTE and in the outer query just filter the n records from this generated column:

WITH j AS (
  SELECT customer_id, timestamp,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY timestamp
                       RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS n
  FROM customer_orders
) 
SELECT customer_id, timestamp FROM j
WHERE n <= 3 
ORDER BY customer_id, timestamp

Demo: db<>fiddle

Upvotes: 1

Related Questions