Reputation:
I'm working on PostgreSQL with PHP.
Is it possible to select a particular number of random values in a column FROM table WHERE condition
instead of
select column FROM table WHERE condition
then convert them into an array and use array_rand()
?
(I do not want to use this way because I will have millions of rows and selecting all values first then array_rand()
is probably going to take a lot of time.)
Let's say I have a table like this:
name | items
-----------+------------
Ben | {dd,ab,aa}
-----------+------------
David | {dd,aa}
-----------+------------
Bryan | {aa,ab,cd}
-----------+------------
Glen | {cd}
-----------+------------
Edward | {aa,cd}
-----------+------------
Aaron | {dd,aa}
-----------+------------
..... (many many more)
And I need to select the 10 random values in a column (or basically the 10 random rows) that match a condition (in this case would be @> ARRAY[aa]
) without a sequential table scan or something that is time-consuming.
order by random()
is going to take a lot of time as it has to deal with every row so I will go with a better solution instead.
Upvotes: 3
Views: 1307
Reputation: 659247
This solution is good if the base table is not heavily updated. Else maintenance costs may exceed the gain.
If your condition is always @> ARRAY[aa]
, you could create an auxiliary look-up table (basically a materialized view).
CREATE TABLE tbl_pick (rn serial, id int, PRIMARY KEY (rn, id);
INSERT INTO tbl_pick (id)
SELECT id FROM tbl
WHERE items @> ARRAY[aa];
Then you can apply a similar method as describe here:
SELECT *
FROM (
SELECT 1 + floor(random() * <insert_count_plus_a_bit_here>)::integer AS rn
FROM generate_series(1, 20) g
GROUP BY 1 -- trim duplicates
) r
JOIN tbl_pick USING (rn)
JOIN tbl USING (id)
LIMIT 10; -- trim surplus
This should be very fast, because it only needs index scans and only ~ 10 rows are read from the table(s).
Of course, you have to update tbl_pick
after (relevant) INSERT / DELETE / UPDATE to tbl
.
Small numbers of Updates can just be added / deleted (no updates) to tbl_pick, because there is some wiggle-room in the method. After a certain number of updates you would TRUNCATE
and rerun the complete INSERT
. Basically rewrite your materialized view.
UPDATEs and DELETEs could be cascaded to tbl_pick
with foreign key constraints with ON UPDATE CASCADE ON DELETE CASCADE
. And a trigger AFTER INSERT
for newly inserted rows. It all depends on what is possible in the base table.
And schedule a complete rewrite of tbl_pick in regular intervals, preferably during off hours.
If your random select queries come in bursts, it might be cheaper to have a "variable" indicating whether tbl_pick
is dirty (instead of fk constraints and trigger) and only refill the table in such a case before you run your query (multiple times). This very much depends on your use patterns. This "variable" could be a one-row table, where only UPDATE
is allowed. Set it to TRUE after (relevant) updates to tbl
, to FALSE after you have refreshed the materialized view.
Upvotes: 1
Reputation:
In PostgreSQL, you can order by random()
, so this should be what you want:
select name
from table
where items @>ARRAY['aa']
order by random()
limit 10;
Upvotes: 6