Reputation: 243
Note: I tried searching for other questions and answers that address this, but I couldn't find one that really corresponded to my problem and presented a full solution.
I'm trying to generate random, synthetic data to test my database schema, using SQL. While generating a bunch of random values in PostgreSQL is easy enough with random()
, generating random data sets that preserve the constraints and characteristics of the data I'm expecting to see is not. Specifically, I have the following tables:
CREATE TABLE suites(
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT
);
INSERT INTO suites(name)
SELECT 'suite' || g FROM generate_series(1,50) g;
CREATE TABLE tests(
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name TEXT
);
INSERT INTO tests(name)
SELECT 'test' || g FROM generate_series(1,100) g;
CREATE TABLE tests_in_suites(
suite_id BIGINT,
test_id BIGINT,
PRIMARY KEY (suite_id, test_id)
);
I want to populate tests_in_suites
with random values, such that each suite contains a random number (between 3 and 7) of tests, selected uniformly from tests
. I want the selection to be random and uniform, and avoid cycles and other repeating patterns. I have tried the following approach:
SELECT s.id, t.id FROM
(select id from suites) s,
(SELECT id FROM tests ORDER BY random() LIMIT 2 + ceil(random() * 5)) t
ORDER BY s.id, t.id;
But it always chooses the same number of tests for each suite and the tests selected are identical, because the optimiser replaces the subquery for s
with a constant. I tried introducing a dependency on the current suite being considered, but it complains the value I'm trying to use is not accessible:
SELECT s.id, t.id FROM
(select id from suites) s,
(SELECT id FROM tests ORDER BY random() LIMIT 2 + ceil(random() * 5 + s.id*0)) t
ORDER BY s.id, t.id;
ERROR: invalid reference to FROM-clause entry for table "s"
LINE 3: ...s ORDER BY random() LIMIT 2 + ceil(random() * 5 + s.id*0)) t
^
HINT: There is an entry for table "s", but it cannot be referenced from this part of the query.
How can I generate my random data without falling victim to either the optimiser or invalid data dependencies in my query?
Upvotes: 2
Views: 2391
Reputation: 11581
I want to populate tests_in_suites with random values, such that each suite contains a random number (between 3 and 7) of tests, selected uniformly from tests
This sounds like a nice use case for a lateral join...
INSERT INTO tests_in_suites(suite_id,test_id)
SELECT suites.id, t.id
FROM suites
CROSS JOIN LATERAL (SELECT id, suites.id AS lol FROM tests ORDER BY random() LIMIT (3+random()*4)) t;
Lateral join recomputes the joined table for each row of the table on the left side of the join, which is what we want here. But postgres will optimize this away if the joined table subquery looks like it is not a dependent subquery after all. So I added suites.id in the joined table to make it look like the joined table is indeed dependent on the row from table suites.
There's probably a way to do it with array_agg() and unnest() too.
Upvotes: 4
Reputation: 243
The solution I found has been inspired by several recipes I've seen online (especially when it comes to using row_number()
to get random selection of rows), but it contains my own insight and I haven't seen this approach used anywhere.
The key components is to break the difficult task of generating random rows into a series of easier tasks, where at each step I only generate random integers. Then, to generate the rows, I use a recursive CTE, and finally use a JOIN
on a window function (row_number()
) to combine the rows into my result table.
The solution below has been tested on PostgreSQL 10 and 12, but it should work on any version which supports recursive CTEs and window functions. It should also be easy to adapt to any other RDBMS out there which supports those.
-- For each suite, add a random number (between 3 and 7) of tests
-- mapped. Because it's difficult to join a random number of rows
-- between two tables in SQL without violating data dependency rules
-- and/or having the optimiser lift it out into a constant, repeating
-- pattern, instead we do it in several steps:
--
-- * For each suite ID, generate a random number between 3 and 7
-- representing the number of tests we want to include
-- * Then, using a recursive CTE, for each suite ID generate rows,
-- each with a random integer no larger than the number of
-- tests. Limit the number of rows to the small integer generated in
-- the previous step
-- * Join the table generated in the above CTE with tests on row
-- number, using the random int generated as the row number to
-- pick. This gives us a table containing three values: suite_id,
-- test_id, random row number. By extracting only the IDs, we have
-- now generate the values to insert into tests_in_suies
INSERT INTO tests_in_suites
-- "+ id*0" serves to ensure the optimiser sees a dependency on the
-- current row and doesn't lift the random() out as a constant
WITH s(id, n_tests) AS (SELECT id, 2 + ceil(random() * 5) + id*0 FROM suites),
cnt AS (SELECT COUNT(*) FROM tests),
t AS (SELECT id, row_number() over (ORDER BY random()) AS rn FROM tests),
sr AS (SELECT * FROM
(WITH RECURSIVE subtests(sid, n, rn) AS (
SELECT s.id, n_tests + 1, NULL::bigint FROM s
UNION
SELECT sid, n - 1, ceil(random() * (SELECT * FROM cnt))::bigint
FROM subtests
WHERE n > 1)
SELECT * FROM subtests) x
WHERE rn IS NOT NULL
ORDER BY sid)
SELECT sid, t.id FROM sr JOIN t USING(rn)
ORDER BY sid, t.id
-- The above will process generate a couple duplicates. They're not a
-- big deal, so just skip them
ON CONFLICT DO NOTHING;
SELECT seen, total, seen / total::double precision as "fraction used" FROM
(SELECT count(*) AS seen FROM (SELECT DISTINCT test_id FROM tests_in_suites) t) x,
(SELECT count(*) AS total FROM tests) y;
SELECT suite_id, count(suite_id) FROM tests_in_suites GROUP BY suite_id;
SELECT * FROM tests_in_suites;
Upvotes: 1