Abe
Abe

Reputation: 9031

Get paginated rows and total count in single query

Core requirement:
Find latest entries for a person_id by submission_date for specified filter criteria type, plan, status. There could be more such filters, but the logic to return latest by submission date is the same regardless. Two major uses one for paginated viewing in UI and second for generating reports.

WITH cte AS (
  SELECT * FROM (
    SELECT my_table.*, rank() OVER (PARTITION BY person_id ORDER BY submission_date DESC, last_updated DESC, id DESC) FROM my_table
    )  rank_filter 
      WHERE RANK=1 AND status in ('ACCEPTED','CORRECTED') AND type != 'CR' AND h_plan_id IN (10000, 20000)
)
SELECT
SELECT count(id) FROM cte group by id,
SELECT * FROM cte limit 10 offset 0;

The group by also does not work on a CTE. a union with all null in the count query might work for combining probably, but not sure.

The main reason I want to combine these two into 1 query is because the table is big and the window function is expensive. Currently I use separate queries which both basically run the same query twice.

Postgres version 12.

\d my_table;
                               Table "public.my_table"
                 Column   |            Type             | Collation | Nullable 
--------------------------+-----------------------------+-----------+----------
 id                       | bigint                      |           | not null 
 h_plan_id                | bigint                      |           | not null 
 h_plan_submitter_id      | bigint                      |           |          
 last_updated             | timestamp without time zone |           |          
 date_created             | timestamp without time zone |           |          
 modified_by              | character varying(255)      |           |          
 segment_number           | integer                     |           |          

 -- <bunch of other text columns>

 submission_date          | character varying(255)      |           |          
 person_id                | character varying(255)      |           |          
 status                   | character varying(255)      |           |          
 file_id                  | bigint                      |           | not null 
Indexes:
    "my_table_pkey" PRIMARY KEY, btree (id)
    "my_table_file_idx" btree (file_id)
    "my_table_hplansubmitter_idx" btree (h_plan_submitter_id)
    "my_table_key_hash_idx" btree (key_hash)
    "my_table_person_id_idx" btree (person_id)
    "my_table_segment_number_idx" btree (segment_number)
Foreign-key constraints:
    "fk38njesaryvhj7e3p4thqkq7pb" FOREIGN KEY (h_plan_id) REFERENCES health_plan(id) ON UPDATE CASCADE ON DELETE CASCADE
    "fk6by9668sowmdob7433mi3rpsu" FOREIGN KEY (h_plan_submitter_id) REFERENCES h_plan_submitter(id) ON UPDATE CASCADE ON DELETE CASCADE
    "fkb06gpo9ng6eujkhnes0eco7bj" FOREIGN KEY (file_id) REFERENCES x12file(id) ON UPDATE CASCADE ON DELETE CASCADE

Additional information Possible values for type are EN and CR with EN being about 70% of the data. Table column widths select avg_width from pg_stats where tablename='mytable'; is a total of 374 for 41 columns so about 9 per col.

The idea is to show some pages upfront to user, they can then filter by additional parameters like file_name(each file usually has about 5k entries), type(very low cardinality), member_add_id(high cardinality), plan_id(low cardinality, every 500k to a million entries will be associated to a plan id). The business requirement in all cases is to show just the latest record for a certain set of plan id's for a submission_date(for reports it is done per year). The order by id was just defensive coding, the same day can have multiple entries and even if someone edited the second last entry hence touching the last_updated timestamp, we want to show only the very last entry of the same data. This probably never happens and can be removed.

User can use this data to generate csv reports.

Result of explain for the query with right join below:

 Nested Loop Left Join  (cost=554076.32..554076.56 rows=10 width=17092) (actual time=4530.914..4530.922 rows=10 loops=1)
   CTE cte
     ->  Unique  (cost=519813.11..522319.10 rows=495358 width=1922) (actual time=2719.093..3523.029 rows=422638 loops=1)
           ->  Sort  (cost=519813.11..521066.10 rows=501198 width=1922) (actual time=2719.091..3301.622 rows=423211 loops=1)
                 Sort Key: mytable.person_id, mytable.submission_date DESC NULLS LAST, mytable.last_updated DESC NULLS LAST, mytable.id DESC
                 Sort Method: external merge  Disk: 152384kB
                 ->  Seq Scan on mytable  (cost=0.00..54367.63 rows=501198 width=1922) (actual time=293.953..468.554 rows=423211 loops=1)
                       Filter: (((status)::text = ANY ('{ACCEPTED,CORRECTED}'::text[])) AND (h_plan_id = ANY ('{1,2}'::bigint[])) AND ((type)::text <> 'CR'::text))
                       Rows Removed by Filter: 10158
   ->  Aggregate  (cost=11145.56..11145.57 rows=1 width=8) (actual time=4142.116..4142.116 rows=1 loops=1)
         ->  CTE Scan on cte  (cost=0.00..9907.16 rows=495358 width=0) (actual time=2719.095..4071.481 rows=422638 loops=1)
   ->  Limit  (cost=20611.67..20611.69 rows=10 width=17084) (actual time=388.777..388.781 rows=10 loops=1)
         ->  Sort  (cost=20611.67..21850.06 rows=495358 width=17084) (actual time=388.776..388.777 rows=10 loops=1)
               Sort Key: cte_1.person_id
               Sort Method: top-N heapsort  Memory: 30kB
               ->  CTE Scan on cte cte_1  (cost=0.00..9907.16 rows=495358 width=17084) (actual time=0.013..128.314 rows=422638 loops=1)
 Planning Time: 0.369 ms
 JIT:
   Functions: 9
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.947 ms, Inlining 4.983 ms, Optimization 178.469 ms, Emission 110.261 ms, Total 295.660 ms
 Execution Time: 4587.711 ms

Upvotes: 5

Views: 10141

Answers (2)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656351

First things first: you can use results from a CTE multiple times in the same query, that's a main feature of CTEs.) What you have would work like this (while still using the CTE once only):

WITH cte AS (
   SELECT * FROM (
      SELECT *, row_number()  -- see below
                OVER (PARTITION BY person_id
                      ORDER BY submission_date DESC NULLS LAST  -- see below
                             , last_updated DESC NULLS LAST  -- see below
                             , id DESC) AS rn
      FROM  tbl
      ) sub
   WHERE  rn = 1
   AND    status IN ('ACCEPTED', 'CORRECTED')
   )
SELECT *, count(*) OVER () AS total_rows_in_cte
FROM   cte
LIMIT  10
OFFSET 0;  -- see below

Caveat 1: rank()

rank() can return multiple rows per person_id with rank = 1. DISTINCT ON (person_id) (like Gordon provided) is an applicable replacement for row_number() - which works for you, as additional info clarified. See:

Caveat 2: ORDER BY submission_date DESC

Neither submission_date nor last_updated are defined NOT NULL. Can be an issue with ORDER BY submission_date DESC, last_updated DESC ... See:

Should those columns really be NOT NULL?

You replied:

Yes, all those columns should be non-null. I can add that constraint. I put it as nullable since we get data in files which are not always perfect. But this is very rare condition and I can put in empty string instead.

Empty strings are not allowed for type date. Keep the columns nullable. NULL is the proper value for those cases. Use NULLS LAST as demonstrated to avoid NULL being sorted on top.

Caveat 3: OFFSET

If OFFSET is equal or greater than the number of rows returned by the CTE, you get no row, so also no total count. See:

Interim solution

Addressing all caveats so far, and based on added information, we might arrive at this query:

WITH cte AS (
   SELECT DISTINCT ON (person_id) *
   FROM   tbl
   WHERE  status IN ('ACCEPTED', 'CORRECTED')
   ORDER  BY person_id, submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC
   )
SELECT *
FROM  (
   TABLE  cte
   ORDER  BY person_id  -- ?? see below
   LIMIT  10
   OFFSET 0
   ) sub
RIGHT  JOIN (SELECT count(*) FROM cte) c(total_rows_in_cte) ON true;

Now the CTE is actually used twice. The RIGHT JOIN guarantees we get the total count, no matter the OFFSET. DISTINCT ON should perform OK-ish for the only few rows per (person_id) in the base query.

But you have wide rows. How wide on average? The query will likely result in a sequential scan on the whole table. Indexes won't help (much). All of this will remain hugely inefficient for paging. See:

You cannot involve an index for paging as that is based on the derived table from the CTE. And your actual sort criteria for paging is still unclear (ORDER BY id ?). If paging is the goal, you desperately need a different query style. If you are only interested in the first few pages, you need a different query style, yet. The best solution depends on information still missing in the question ...

Radically faster

For your updated objective:

Find latest entries for a person_id by submission_date

(Ignoring "for specified filter criteria, type, plan, status" for simplicity.)

And:

Find the latest row per person_id only if that has status IN ('ACCEPTED','CORRECTED')

Based on these two specialized indices:

CREATE INDEX ON tbl (submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST)
WHERE  status IN ('ACCEPTED', 'CORRECTED'); -- optional

CREATE INDEX ON tbl (person_id, submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST);

Run this query:

WITH RECURSIVE cte AS (
   (
   SELECT t  -- whole row
   FROM   tbl t
   WHERE  status IN ('ACCEPTED', 'CORRECTED')
   AND    NOT EXISTS (SELECT FROM tbl
                      WHERE  person_id = t.person_id 
                      AND   (  submission_date,   last_updated,   id)
                          > (t.submission_date, t.last_updated, t.id)  -- row-wise comparison
                      )
   ORDER  BY submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST
   LIMIT  1
   )

   UNION ALL
   SELECT (SELECT t1  -- whole row
           FROM   tbl t1
           WHERE ( t1.submission_date, t1.last_updated, t1.id)
               < ((t).submission_date,(t).last_updated,(t).id)  -- row-wise comparison
           AND    t1.status IN ('ACCEPTED', 'CORRECTED')
           AND    NOT EXISTS (SELECT FROM tbl
                              WHERE  person_id = t1.person_id 
                              AND   (   submission_date,    last_updated,    id)
                                  > (t1.submission_date, t1.last_updated, t1.id)  -- row-wise comparison
                              )
           ORDER  BY submission_date DESC NULLS LAST, last_updated DESC NULLS LAST, id DESC NULLS LAST
           LIMIT  1)
   FROM   cte c
   WHERE  (t).id IS NOT NULL
   )
SELECT (t).*
FROM   cte
LIMIT  10
OFFSET 0;

Every set of parentheses here is required.

This level of sophistication should retrieve a relatively small set of top rows radically faster by using the given indices and no sequential scan. See:

submission_date should most probably be type timestamptz or date, not character varying(255) - which is an odd type definition in Postgres in any case. See:

Many more details might be optimized, but this is getting out of hands. You might consider professional consulting.

Upvotes: 7

Gordon Linoff
Gordon Linoff

Reputation: 1269513

You might try simplifying the query:

SELECT DISTINCT ON (person_id) t.*
FROM my_table t
WHERE status in ('ACCEPTED', 'CORRECTED')
ORDER BY person_id, submission_date DESC, last_updated DESC, id DESC

I am not sure if Postgres is smart enough to use an index on (person_id, submission_date DESC, last_updated DESC, id DESC, status) in this case, but it is worth a try.

You could speed this using an index on an expression: (status in ('ACCEPTED', 'CORRECTED'), person_id, submission_date DESC, last_updated DESC, id DESC).

EDIT:

If you want to sort by another column, you can use a subquery:

SELECT t.*
FROM (SELECT DISTINCT ON (person_id) t.*
      FROM my_table t
      WHERE status in ('ACCEPTED', 'CORRECTED')
      ORDER BY person_id, submission_date DESC, last_updated DESC, id DESC
     ) t
ORDER BY submission_date DESC

Upvotes: 3

Related Questions