Reputation: 9031
Core requirement:
Find latest entries for a person_id
by submission_date
for specified filter criteria type
, plan
, status
. There could be more such filters, but the logic to return latest by submission date is the same regardless. Two major uses one for paginated viewing in UI and second for generating reports.
WITH cte AS (
SELECT * FROM (
SELECT my_table.*, rank() OVER (PARTITION BY person_id ORDER BY submission_date DESC, last_updated DESC, id DESC) FROM my_table
) rank_filter
WHERE RANK=1 AND status in ('ACCEPTED','CORRECTED') AND type != 'CR' AND h_plan_id IN (10000, 20000)
)
SELECT
SELECT count(id) FROM cte group by id,
SELECT * FROM cte limit 10 offset 0;
The group by
also does not work on a CTE. a union with all null
in the count query might work for combining probably, but not sure.
The main reason I want to combine these two into 1 query is because the table is big and the window function is expensive. Currently I use separate queries which both basically run the same query twice.
Postgres version 12.
\d my_table;
Table "public.my_table"
Column | Type | Collation | Nullable
--------------------------+-----------------------------+-----------+----------
id | bigint | | not null
h_plan_id | bigint | | not null
h_plan_submitter_id | bigint | |
last_updated | timestamp without time zone | |
date_created | timestamp without time zone | |
modified_by | character varying(255) | |
segment_number | integer | |
-- <bunch of other text columns>
submission_date | character varying(255) | |
person_id | character varying(255) | |
status | character varying(255) | |
file_id | bigint | | not null
Indexes:
"my_table_pkey" PRIMARY KEY, btree (id)
"my_table_file_idx" btree (file_id)
"my_table_hplansubmitter_idx" btree (h_plan_submitter_id)
"my_table_key_hash_idx" btree (key_hash)
"my_table_person_id_idx" btree (person_id)
"my_table_segment_number_idx" btree (segment_number)
Foreign-key constraints:
"fk38njesaryvhj7e3p4thqkq7pb" FOREIGN KEY (h_plan_id) REFERENCES health_plan(id) ON UPDATE CASCADE ON DELETE CASCADE
"fk6by9668sowmdob7433mi3rpsu" FOREIGN KEY (h_plan_submitter_id) REFERENCES h_plan_submitter(id) ON UPDATE CASCADE ON DELETE CASCADE
"fkb06gpo9ng6eujkhnes0eco7bj" FOREIGN KEY (file_id) REFERENCES x12file(id) ON UPDATE CASCADE ON DELETE CASCADE
Additional information
Possible values for type
are EN
and CR
with EN
being about 70% of the data.
Table column widths select avg_width from pg_stats where tablename='mytable';
is a total of 374 for 41 columns so about 9 per col.
The idea is to show some pages upfront to user, they can then filter by additional parameters like file_name
(each file usually has about 5k entries), type
(very low cardinality), member_add_id
(high cardinality), plan_id
(low cardinality, every 500k to a million entries will be associated to a plan id). The business requirement in all cases is to show just the latest record for a certain set of plan id's for a submission_date
(for reports it is done per year). The order by id was just defensive coding, the same day can have multiple entries and even if someone edited the second last entry hence touching the last_updated
timestamp, we want to show only the very last entry of the same data. This probably never happens and can be removed.
User can use this data to generate csv reports.
Result of explain for the query with right join below:
Nested Loop Left Join (cost=554076.32..554076.56 rows=10 width=17092) (actual time=4530.914..4530.922 rows=10 loops=1)
CTE cte
-> Unique (cost=519813.11..522319.10 rows=495358 width=1922) (actual time=2719.093..3523.029 rows=422638 loops=1)
-> Sort (cost=519813.11..521066.10 rows=501198 width=1922) (actual time=2719.091..3301.622 rows=423211 loops=1)
Sort Key: mytable.person_id, mytable.submission_date DESC NULLS LAST, mytable.last_updated DESC NULLS LAST, mytable.id DESC
Sort Method: external merge Disk: 152384kB
-> Seq Scan on mytable (cost=0.00..54367.63 rows=501198 width=1922) (actual time=293.953..468.554 rows=423211 loops=1)
Filter: (((status)::text = ANY ('{ACCEPTED,CORRECTED}'::text[])) AND (h_plan_id = ANY ('{1,2}'::bigint[])) AND ((type)::text <> 'CR'::text))
Rows Removed by Filter: 10158
-> Aggregate (cost=11145.56..11145.57 rows=1 width=8) (actual time=4142.116..4142.116 rows=1 loops=1)
-> CTE Scan on cte (cost=0.00..9907.16 rows=495358 width=0) (actual time=2719.095..4071.481 rows=422638 loops=1)
-> Limit (cost=20611.67..20611.69 rows=10 width=17084) (actual time=388.777..388.781 rows=10 loops=1)
-> Sort (cost=20611.67..21850.06 rows=495358 width=17084) (actual time=388.776..388.777 rows=10 loops=1)
Sort Key: cte_1.person_id
Sort Method: top-N heapsort Memory: 30kB
-> CTE Scan on cte cte_1 (cost=0.00..9907.16 rows=495358 width=17084) (actual time=0.013..128.314 rows=422638 loops=1)
Planning Time: 0.369 ms
JIT:
Functions: 9
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.947 ms, Inlining 4.983 ms, Optimization 178.469 ms, Emission 110.261 ms, Total 295.660 ms
Execution Time: 4587.711 ms
Upvotes: 5
Views: 10141
Reputation: 656351
First things first: you can use results from a CTE multiple times in the same query, that's a main feature of CTEs.) What you have would work like this (while still using the CTE once only):
WITH cte AS (
SELECT * FROM (
SELECT *, row_number() -- see below
OVER (PARTITION BY person_id
ORDER BY submission_date DESC NULLS LAST -- see below
, last_updated DESC NULLS LAST -- see below
, id DESC) AS rn
FROM tbl
) sub
WHERE rn = 1
AND status IN ('ACCEPTED', 'CORRECTED')
)
SELECT *, count(*) OVER () AS total_rows_in_cte
FROM cte
LIMIT 10
OFFSET 0; -- see below
rank()
rank()
can return multiple rows per person_id
with rank = 1
. DISTINCT ON (person_id)
(like Gordon provided) is an applicable replacement for row_number()
- which works for you, as additional info clarified. See:
ORDER BY submission_date DESC
Neither submission_date
nor last_updated
are defined NOT NULL
. Can be an issue with ORDER BY submission_date DESC, last_updated DESC ...
See:
Should those columns really be NOT NULL
?
You replied:
Yes, all those columns should be non-null. I can add that constraint. I put it as nullable since we get data in files which are not always perfect. But this is very rare condition and I can put in empty string instead.
Empty strings are not allowed for type date
. Keep the columns nullable. NULL
is the proper value for those cases. Use NULLS LAST
as demonstrated to avoid NULL
being sorted on top.
OFFSET
If OFFSET
is equal or greater than the number of rows returned by the CTE, you get no row, so also no total count. See:
Addressing all caveats so far, and based on added information, we might arrive at this query:
WITH cte AS (
SELECT DISTINCT ON (person_id) *
FROM tbl
WHERE status IN ('ACCEPTED', 'CORRECTED')
ORDER BY person_id, submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC
)
SELECT *
FROM (
TABLE cte
ORDER BY person_id -- ?? see below
LIMIT 10
OFFSET 0
) sub
RIGHT JOIN (SELECT count(*) FROM cte) c(total_rows_in_cte) ON true;
Now the CTE is actually used twice. The RIGHT JOIN
guarantees we get the total count, no matter the OFFSET
. DISTINCT ON
should perform OK-ish for the only few rows per (person_id)
in the base query.
But you have wide rows. How wide on average? The query will likely result in a sequential scan on the whole table. Indexes won't help (much). All of this will remain hugely inefficient for paging. See:
You cannot involve an index for paging as that is based on the derived table from the CTE. And your actual sort criteria for paging is still unclear (ORDER BY id
?). If paging is the goal, you desperately need a different query style. If you are only interested in the first few pages, you need a different query style, yet. The best solution depends on information still missing in the question ...
For your updated objective:
Find latest entries for a
person_id
bysubmission_date
(Ignoring "for specified filter criteria, type, plan, status" for simplicity.)
And:
Find the latest row per
person_id
only if that hasstatus IN ('ACCEPTED','CORRECTED')
Based on these two specialized indices:
CREATE INDEX ON tbl (submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST)
WHERE status IN ('ACCEPTED', 'CORRECTED'); -- optional
CREATE INDEX ON tbl (person_id, submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST);
Run this query:
WITH RECURSIVE cte AS (
(
SELECT t -- whole row
FROM tbl t
WHERE status IN ('ACCEPTED', 'CORRECTED')
AND NOT EXISTS (SELECT FROM tbl
WHERE person_id = t.person_id
AND ( submission_date, last_updated, id)
> (t.submission_date, t.last_updated, t.id) -- row-wise comparison
)
ORDER BY submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT (SELECT t1 -- whole row
FROM tbl t1
WHERE ( t1.submission_date, t1.last_updated, t1.id)
< ((t).submission_date,(t).last_updated,(t).id) -- row-wise comparison
AND t1.status IN ('ACCEPTED', 'CORRECTED')
AND NOT EXISTS (SELECT FROM tbl
WHERE person_id = t1.person_id
AND ( submission_date, last_updated, id)
> (t1.submission_date, t1.last_updated, t1.id) -- row-wise comparison
)
ORDER BY submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST
LIMIT 1)
FROM cte c
WHERE (t).id IS NOT NULL
)
SELECT (t).*
FROM cte
LIMIT 10
OFFSET 0;
Every set of parentheses here is required.
This level of sophistication should retrieve a relatively small set of top rows radically faster by using the given indices and no sequential scan. See:
submission_date
should most probably be type timestamptz
or date
, not - which is an odd type definition in Postgres in any case. See:character varying(255)
Many more details might be optimized, but this is getting out of hands. You might consider professional consulting.
Upvotes: 7
Reputation: 1269513
You might try simplifying the query:
SELECT DISTINCT ON (person_id) t.*
FROM my_table t
WHERE status in ('ACCEPTED', 'CORRECTED')
ORDER BY person_id, submission_date DESC, last_updated DESC, id DESC
I am not sure if Postgres is smart enough to use an index on (person_id, submission_date DESC, last_updated DESC, id DESC, status)
in this case, but it is worth a try.
You could speed this using an index on an expression: (status in ('ACCEPTED', 'CORRECTED'), person_id, submission_date DESC, last_updated DESC, id DESC)
.
EDIT:
If you want to sort by another column, you can use a subquery:
SELECT t.*
FROM (SELECT DISTINCT ON (person_id) t.*
FROM my_table t
WHERE status in ('ACCEPTED', 'CORRECTED')
ORDER BY person_id, submission_date DESC, last_updated DESC, id DESC
) t
ORDER BY submission_date DESC
Upvotes: 3