PostgreSQL: Random Selection?

Question

I'm working on PostgreSQL with PHP.

Is it possible to select a particular number of random values in a column FROM table WHERE condition

instead of

select column FROM table WHERE condition

then convert them into an array and use array_rand()?

(I do not want to use this way because I will have millions of rows and selecting all values first then array_rand() is probably going to take a lot of time.)

Let's say I have a table like this:

   name    |   items
-----------+------------
    Ben    | {dd,ab,aa}  
-----------+------------
   David   |  {dd,aa}  
-----------+------------
   Bryan   | {aa,ab,cd}
-----------+------------
    Glen   |    {cd}
-----------+------------
   Edward  |   {aa,cd}
-----------+------------
   Aaron   |  {dd,aa}
-----------+------------
  ..... (many many more)

Updates:

And I need to select the 10 random values in a column (or basically the 10 random rows) that match a condition (in this case would be @> ARRAY[aa]) without a sequential table scan or something that is time-consuming.

order by random() is going to take a lot of time as it has to deal with every row so I will go with a better solution instead.

Erwin Brandstetter · Accepted Answer

This solution is good if the base table is not heavily updated. Else maintenance costs may exceed the gain.

If your condition is always @> ARRAY[aa], you could create an auxiliary look-up table (basically a materialized view).

CREATE TABLE tbl_pick (rn serial, id int, PRIMARY KEY (rn, id);

INSERT INTO tbl_pick (id)
SELECT id FROM tbl
WHERE  items @> ARRAY[aa];

Then you can apply a similar method as describe here:

SELECT *
FROM  (
    SELECT 1 + floor(random() * )::integer AS rn
    FROM   generate_series(1, 20) g
    GROUP  BY 1                     -- trim duplicates
    ) r
JOIN   tbl_pick USING (rn)
JOIN   tbl USING (id)
LIMIT  10;                          -- trim surplus

This should be very fast, because it only needs index scans and only ~ 10 rows are read from the table(s).

Of course, you have to update tbl_pick after (relevant) INSERT / DELETE / UPDATE to tbl. Small numbers of Updates can just be added / deleted (no updates) to tbl_pick, because there is some wiggle-room in the method. After a certain number of updates you would TRUNCATE and rerun the complete INSERT. Basically rewrite your materialized view.

UPDATEs and DELETEs could be cascaded to tbl_pick with foreign key constraints with ON UPDATE CASCADE ON DELETE CASCADE. And a trigger AFTER INSERT for newly inserted rows. It all depends on what is possible in the base table.

And schedule a complete rewrite of tbl_pick in regular intervals, preferably during off hours.

If your random select queries come in bursts, it might be cheaper to have a "variable" indicating whether tbl_pick is dirty (instead of fk constraints and trigger) and only refill the table in such a case before you run your query (multiple times). This very much depends on your use patterns. This "variable" could be a one-row table, where only UPDATE is allowed. Set it to TRUE after (relevant) updates to tbl, to FALSE after you have refreshed the materialized view.

PostgreSQL: Random Selection?

Updates:

Answers (2)

Related Questions