Reputation: 1657
Goal:
x
number of randomly chosen ids from the table which have not been assigned.This is for something quite specific and while I understand there are different ways of doing this, I'd like to know if there's a solution to the flaw in this particular implementation.
I have something that partially works, but wondering where the flaw in the function is.
Here's the table:
CREATE SEQUENCE accounts_seq MINVALUE 700000000001 NO MAXVALUE;
CREATE TABLE accounts (
id BIGINT PRIMARY KEY default nextval('accounts_seq'),
client VARCHAR(25), UNIQUE(id, client)
);
This function gen_account_ids
is just a one-time setup to pre-populate the table with a fixed number of rows, all marked as unassigned.
/*
This function will insert new rows into the accounts table with ids being
generated by a sequence, and client being NULL. A NULL client indicates
the account has not yet been assigned.
*/
CREATE OR REPLACE FUNCTION gen_account_ids(bigint)
RETURNS INT AS $gen_account_ids$
DECLARE
-- count is the number of new accounts you want generated
count alias for $1;
-- rowcount is returned as the number of rows inserted
rowcount int;
BEGIN
INSERT INTO accounts(client) SELECT NULL FROM generate_series(1, count);
GET DIAGNOSTICS rowcount = ROW_COUNT;
RETURN rowcount;
END;
$gen_account_ids$ LANGUAGE plpgsql;
So, I use this to pre-populate the table with, say 1000 records:
SELECT gen_account_ids(1000);
The next function assign
is meant to randomly select an unassigned id (unassigned means client
column is null), and update it with a client value so it becomes assigned. It returns the number of rows affected.
It works sometimes, but I do believe there are collisions occurring -- which is why I tried for DISTINCT
, but it often returns fewer than the desired number of rows. For example, if I select assign(100, 'foo');
it might return 95 rows instead of the desired 100.
How can I modify this to make it always return the exact desired rows?
/*
This will assign ids to a client randomly
@param int is the number of account numbers to generate
@param varchar(10) is a string descriptor for the client
@returns the number of rows affected -- should be the same as the input int
Call it like this: `SELECT * FROM assign(100, 'FOO')`
*/
CREATE OR REPLACE FUNCTION assign(INT, VARCHAR(10))
RETURNS INT AS $$
DECLARE
total ALIAS FOR $1;
clientname ALIAS FOR $2;
rowcount int;
BEGIN
UPDATE accounts SET client = clientname WHERE id IN (
SELECT DISTINCT trunc(random() * (
(SELECT max(id) FROM accounts WHERE client IS NULL) -
(SELECT min(id) FROM accounts WHERE client IS NULL)) +
(SELECT min(id) FROM accounts WHERE client IS NULL)) FROM generate_series(1, total));
GET DIAGNOSTICS rowcount = ROW_COUNT;
RETURN rowcount;
END;
$$ LANGUAGE plpgsql;
This is loosely based on this where you can do something like SELECT trunc(random() * (100 - 1) + 1) FROM generate_series(1,5);
which will select 5 random numbers between 1 and 100.
My goal is to do something similar where I select a random id between the min and max unassigned rows, and mark it for update.
Upvotes: 2
Views: 146
Reputation: 121624
Because ids
of random subset of rows are not consecutive, select a random row_number()
instead of random id
.
with nulls as ( -- base query
select id
from accounts
where client is null
),
randoms as ( -- calculate random int in range 1..count(nulls.*)
select trunc(random()* (count(*) - 1) + 1)::int random_value
from nulls
),
row_numbers as ( -- add row numbers to nulls
select id, row_number() over (order by id) rn
from nulls
)
select id
from row_numbers, randoms
where rn = random_value; -- random row number
A function is not necessary here, but you can easily place the query in a function body if needed.
This query updates 5 random rows with null client
.
update accounts
set client = 'new value' -- <-- clientname
where id in (
with nulls as ( -- base query
select id
from accounts
where client is null
),
randoms as ( -- calculate random int in range 1..count(nulls.*)
select i, trunc(random()* (count(*) - 1) + 1)::int random_value
from nulls
cross join generate_series(1, 5) i -- <-- total
group by 1
),
row_numbers as ( -- add row numbers to nulls in order by id
select id, row_number() over (order by id) rn
from nulls
)
select id
from row_numbers, randoms
where rn = random_value -- random row number
)
However, there is no certainty that the query will update exactly 5 rows, because
select trunc(random()* (max_value - 1) + 1)::int
from generate_series(1, n)
is not a correct way to generate n
different random values. The probability of repetitions increases with the quotient n / max_value
.
Upvotes: 1
Reputation: 1657
This isn't the best answer b/c it does involve full table scans, but in my situation, I'm not concerned about the performance, and it works. This is based off @CraigRinger's reference to the blog post getting random tuples
I'd be generally interested in hearing about other (perhaps better) solutions -- and am specifically curious about why the original solution falls short, and what @klin also devised.
So, here's my brute force random order solution:
-- generate a million unassigned rows with null client column
insert into accounts(client) select null from generate_series(1, 1000000);
-- assign 1000 random rows to client 'foo'
update accounts set client = 'foo' where id in
(select id from accounts where client is null order by random() limit 1000);
Upvotes: 2