zenocon
zenocon

Reputation: 1657

PL/pgSQL function to randomly select an id

Goal:

  1. pre-populate a table with a list of sequential id, from e.g. 1 to 1,000,000. The table has an additional column that is nillable. NULL values are marked as unassigned and non-NULL values are marked as assigned
  2. have function i can call that asks for x number of randomly chosen ids from the table which have not been assigned.

This is for something quite specific and while I understand there are different ways of doing this, I'd like to know if there's a solution to the flaw in this particular implementation.

I have something that partially works, but wondering where the flaw in the function is.

Here's the table:

CREATE SEQUENCE accounts_seq MINVALUE 700000000001 NO MAXVALUE;

CREATE TABLE accounts (
  id BIGINT PRIMARY KEY default nextval('accounts_seq'), 
  client VARCHAR(25), UNIQUE(id, client)
);

This function gen_account_ids is just a one-time setup to pre-populate the table with a fixed number of rows, all marked as unassigned.

/*
  This function will insert new rows into the accounts table with ids being
  generated by a sequence, and client being NULL.  A NULL client indicates
  the account has not yet been assigned.
*/
CREATE OR REPLACE FUNCTION gen_account_ids(bigint)
  RETURNS INT AS $gen_account_ids$
DECLARE
  -- count is the number of new accounts you want generated
  count alias for $1;
  -- rowcount is returned as the number of rows inserted
  rowcount int;
BEGIN
  INSERT INTO accounts(client) SELECT NULL FROM generate_series(1, count);
  GET DIAGNOSTICS rowcount = ROW_COUNT;
  RETURN rowcount;
END;
$gen_account_ids$ LANGUAGE plpgsql;

So, I use this to pre-populate the table with, say 1000 records:

SELECT gen_account_ids(1000);

The next function assign is meant to randomly select an unassigned id (unassigned means client column is null), and update it with a client value so it becomes assigned. It returns the number of rows affected.

It works sometimes, but I do believe there are collisions occurring -- which is why I tried for DISTINCT, but it often returns fewer than the desired number of rows. For example, if I select assign(100, 'foo'); it might return 95 rows instead of the desired 100.

How can I modify this to make it always return the exact desired rows?

   /*
     This will assign ids to a client randomly
     @param int is the number of account numbers to generate
     @param varchar(10) is a string descriptor for the client
     @returns the number of rows affected -- should be the same as the input int

     Call it like this: `SELECT * FROM assign(100, 'FOO')`
   */
   CREATE OR REPLACE FUNCTION assign(INT, VARCHAR(10))
     RETURNS INT AS $$
   DECLARE
     total ALIAS FOR $1;
     clientname ALIAS FOR $2;
     rowcount int;
   BEGIN
     UPDATE accounts SET client = clientname WHERE id IN (
       SELECT DISTINCT trunc(random() * (
         (SELECT max(id) FROM accounts WHERE client IS NULL) - 
         (SELECT min(id) FROM accounts WHERE client IS NULL)) + 
         (SELECT min(id) FROM accounts WHERE client IS NULL)) FROM generate_series(1, total));
     GET DIAGNOSTICS rowcount = ROW_COUNT;
     RETURN rowcount;
   END;
   $$ LANGUAGE plpgsql;

This is loosely based on this where you can do something like SELECT trunc(random() * (100 - 1) + 1) FROM generate_series(1,5); which will select 5 random numbers between 1 and 100.

My goal is to do something similar where I select a random id between the min and max unassigned rows, and mark it for update.

Upvotes: 2

Views: 146

Answers (2)

klin
klin

Reputation: 121624

Because ids of random subset of rows are not consecutive, select a random row_number() instead of random id.

with nulls as ( -- base query
    select id
    from accounts 
    where client is null
    ),
randoms as ( -- calculate random int in range 1..count(nulls.*) 
    select trunc(random()* (count(*) - 1) + 1)::int random_value
    from nulls
    ),
row_numbers as ( -- add row numbers to nulls
    select id, row_number() over (order by id) rn
    from nulls
    )
select id
from row_numbers, randoms
where rn = random_value; -- random row number

A function is not necessary here, but you can easily place the query in a function body if needed.

This query updates 5 random rows with null client.

update accounts
set client = 'new value' -- <-- clientname
where id in (
    with nulls as ( -- base query
        select id
        from accounts 
        where client is null
        ),
    randoms as ( -- calculate random int in range 1..count(nulls.*) 
        select i, trunc(random()* (count(*) - 1) + 1)::int random_value
        from nulls
        cross join generate_series(1, 5) i -- <--  total
        group by 1
        ),
    row_numbers as ( -- add row numbers to nulls in order by id
        select id, row_number() over (order by id) rn
        from nulls
        )
    select id
    from row_numbers, randoms
    where rn = random_value -- random row number
)

However, there is no certainty that the query will update exactly 5 rows, because

select trunc(random()* (max_value - 1) + 1)::int
from generate_series(1, n)

is not a correct way to generate n different random values. The probability of repetitions increases with the quotient n / max_value.

Upvotes: 1

zenocon
zenocon

Reputation: 1657

This isn't the best answer b/c it does involve full table scans, but in my situation, I'm not concerned about the performance, and it works. This is based off @CraigRinger's reference to the blog post getting random tuples

I'd be generally interested in hearing about other (perhaps better) solutions -- and am specifically curious about why the original solution falls short, and what @klin also devised.

So, here's my brute force random order solution:

-- generate a million unassigned rows with null client column
insert into accounts(client) select null from generate_series(1, 1000000);

-- assign 1000 random rows to client 'foo'
update accounts set client = 'foo' where id in 
  (select id from accounts where client is null order by random() limit 1000);

Upvotes: 2

Related Questions