Reputation: 43
I am trying to find the first 3 timestamps for each customer.
Sample data for table customer_orders
:
customer_id | timestamp |
---|---|
6778 | '2022-01-01' |
6778 | '2022-02-05' |
5544 | '2022-04-01' |
6778 | '2022-02-04' |
5544 | '2022-04-03' |
5544 | '2022-04-02' |
5544 | '2022-01-01' |
6778 | '2021-01-01' |
Desired outcome:
customer_id | timestamp |
---|---|
5544 | '2022-01-01' |
5544 | '2022-04-01' |
5544 | '2022-04-02' |
6778 | '2021-01-01' |
6778 | '2022-01-01' |
6778 | '2022-04-02' |
My query so far:
SELECT
customer_id,
timestamp
FROM customer_orders
GROUP BY customer_id, timestamp
ORDER BY timestamp ASC
LIMIT 3
LIMIT 3
limits to 3 rows overall. But I want 3 rows per customer.
Upvotes: 1
Views: 1730
Reputation: 656872
@Jim provided a valid solution. But there are subtle performance details. RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
is the default window frame, spelling it out is just noise. The manual:
The default framing option is
RANGE UNBOUNDED PRECEDING
, which is the same asRANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
.
But row_number()
operates on rows by definition, it's more efficient to use ROWS
mode:
SELECT customer_id, timestamp
FROM (
SELECT row_number() OVER (PARTITION BY customer_id ORDER BY timestamp
ROWS UNBOUNDED PRECEDING) AS rn
, customer_id, timestamp
FROM customer_orders
) sub
WHERE rn <= 3
ORDER BY 1, 2;
An index like this is essential to sped it up:
CREATE UNIQUE INDEX ON customer_orders (customer_id, timestamp);
I ran extensive tests on Postgres 13 and 14 with ROWS
vs. RANGE
mode, and ROWS
is consistently ~ 20% faster with an index-only scan. Quite a revelation for one of the most commonly used window functions!
I eventually presented this case to the Postgres devs, and Postgres 16 has this optimization now built in! See:
Meaning, in Postgres 16 (going forward), a plain row_number()
is just as fast. In earlier versions adding the frame clause ROWS UNBOUNDED PRECEDING
makes it considerably faster.
That said, if your table is big and there are many rows per customer, a different query style is much faster, yet - by orders of magnitude. We need the same index as above.
Ideally, you have a table customers
with exactly one row per relevant customer_id
. If you don't have it, create it. Then:
SELECT c.customer_id, o.timestamp
FROM customers c
CROSS JOIN LATERAL (
SELECT timestamp
FROM customer_orders o
WHERE o.customer_id = c.customer_id
ORDER BY o.timestamp
LIMIT 3
) o
ORDER BY 1, 2;
Related:
Upvotes: 0
Reputation: 19623
You can use the window function ROW_NUMBER()
to numerate the columns of a given PARTITION
(customer_id
in your case) within a CTE
and in the outer query just filter the n records from this generated column:
WITH j AS (
SELECT customer_id, timestamp,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY timestamp
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS n
FROM customer_orders
)
SELECT customer_id, timestamp FROM j
WHERE n <= 3
ORDER BY customer_id, timestamp
Demo: db<>fiddle
Upvotes: 1