Adnan Mohib
Adnan Mohib

Reputation: 369

Evenly select Records on categorical column with Repeating Sequence and pagination in Postgres

Database: Postgres
I have a product(id, title, source, ...) table which contains almost 500K records. An example of data is:

| Id | title    | source   |
|:---|---------:|:--------:|
| 1  | product1 | source1  |
| 2  | product2 | source1  |
| 3  | product3 | source1  |
| 4  | product4 | source1  |
| .  | ........ | source1  |
| .  | ........ | source2  |
| x  | productx | source2  |
|x+n |productX+n| sourceN  |

There are are 5 distinct source values. And all records have source values random.

I need to get paginated results in such a way that: If I need to select 20 products then the results set should contain results equally distributed based on source and should be in a repeating sequence. 2 products from each source till the last source and again next 2 products from each source. For example:

| #  | title    | source   |
|:---|---------:|:--------:|
| 1  | product1 | source1  |
| 2  | product2 | source1  |
| 3  | product3 | source2  |
| 4  | product4 | source2  |
| 5  | product5 | source3  |
| 6  | product6 | source3  |
| 7  | product7 | source4  |
| 8  | product8 | source4  |
| 9  | product9 | source5  |
| 10 |product10 | source5  |
| 11 | ........ | source1  |
| 12 | ........ | source1  |
| 13 | ........ | source2  |
| 14 | ........ | source2  |
| .. | ........ | .......  |
| 20 | ........ | source5  |

What is the optimized PgSql query to achieve the above scenario considering the LIMIT, OFFSET, sources can increase or decrease?

EDIT
As Suggested by George S, the below solution works, however, it is less performant. it takes almost 6 seconds to select only 20 records.

select id, title, source
, (row_number() over(partition by source order by last_modified DESC) - 1) / 2 as ordinal 
   -- order here can be by created time, id, title, etc
from product p
order by ordinal, source
limit 20
offset 2;

Explain ANALYZE of above query on real data

Limit  (cost=147621.60..147621.65 rows=20 width=92) (actual time=5956.126..5956.138 rows=20 loops=1)
  ->  Sort  (cost=147621.60..148813.72 rows=476848 width=92) (actual time=5956.123..5956.128 rows=22 loops=1)
        Sort Key: (((row_number() OVER (?) - 1) / 2)), provider
        Sort Method: top-N heapsort  Memory: 28kB
        ->  WindowAgg  (cost=122683.80..134605.00 rows=476848 width=92) (actual time=5099.059..5772.821 rows=477731 loops=1)
              ->  Sort  (cost=122683.80..123875.92 rows=476848 width=84) (actual time=5098.873..5347.858 rows=477731 loops=1)
                    Sort Key: provider, last_modified DESC
                    Sort Method: external merge  Disk: 46328kB
                    ->  Seq Scan on product p  (cost=0.00..54889.48 rows=476848 width=84) (actual time=0.012..4360.000 rows=477731 loops=1)
Planning Time: 0.354 ms
Execution Time: 5961.670 ms

Upvotes: 2

Views: 331

Answers (1)

George S
George S

Reputation: 2151

This can be accomplished easily with a window function:

select id, title, source
, (row_number() over(partition by source order by id) - 1) / 2 as ordinal 
   --ordering here can be by created time, id, title, etc
from product p
order by ordinal, source
limit 10
offset 2;

As you noted, depending on your table size and other filters being used this may or may not be performant. The best way to tell is to run an EXPLAIN ANALYZE with the query on your actual data. If this isn't performant, you can also add the ordinal field to the table itself if it will always be the same value / ordering. Sadly, you can't create an index using a window function (at least not in PG12).

If you don't want to change the table itself, you can create a materialized view and then query that view so that the calculation only has to be done once:

CREATE MATERIALIZED VIEW ordered_product AS
  select id, title, source
  , (row_number() over(partition by source order by id) - 1) / 2 as ordinal 
   from product;

Afterwards, you can query the view like a normal table:

select * from ordered_product order by ordinal, source limit 10 offset 20;

You can also create indexes for it if necessary. Note that to refresh the view you'd run a command like:

REFRESH MATERIALIZED VIEW ordered_product;

Upvotes: 3

Related Questions