Generate table with random values from other tables in PostgreSQL

The following code generates 100000 rows with random values for the Id column with uuid_generate_v4(). However, the nested selects are always choosing the same row so all the inserted rows have the same values for those columns. The goal is to create a table with 100k rows with random values taken from the other sample tables. Each of the sample tables only have two columns (Id and the column from which the values are taken). How can this be archived?

insert into "Tag" (
    "Id", "Time", "Account", "Name", "Value", "RollUpTableId"
)
select
    uuid_generate_v4(),
    current_timestamp,
    (select "Account" from "AccountSamples" OFFSET floor(random()*358) LIMIT 1),
    (select "Name" from "TagNameSamples" OFFSET floor(random()*19) LIMIT 1),
    (select "Value" from "TagValueSamples" OFFSET floor(random()*26) LIMIT 1),
    uuid_generate_v4()
from generate_series(1, 100000);

I've also tried with select "Account" from "AccountSamples" where "Id" = (trunc(random() * 358)::integer)

Upvotes: 0

Answers (2)

user33276346

Reputation: 1739

I'm sharing, for anyone who might deal with something similar, an answer by pehrs from Reddit that solves the problem.

In your first solution, the problem is that the subselect does not have any external dependency, so it will be optimized to call random only once. You can fix this by adding an external dependency. Try something like this:

select uuid_generate_v4(),
       current_timestamp,
       (select "Account" from "AccountSamples" WHERE gen=gen OFFSET floor(random()*358) LIMIT 1),
       (select "Name" from "TagNameSamples" WHERE gen=gen OFFSET floor(random()*19) LIMIT 1),
       (select "Value" from "TagValueSamples" WHERE gen=gen OFFSET floor(random()*26) LIMIT 1 ),
       uuid_generate_v4()
  from generate_series(1, 100000) gen;

By the way, the typical way to pick randomly from a table without having to precalculate the size of the table is something like

SELECT foo FROM bar ORDER BY random() LIMIT 1

It is not overly performant, but it is simple and well understood.

Upvotes: 0

GMB

Reputation: 222482

Likely, Postgres is optimizing the subqueries, and does not reexcutes them for each row.

I would recommend enumerating randomly in subqueries, then joining:

select uuid_generate_v4(), a."Account", tns."Name", tvs."Value"
from (
    select "Account", row_number() over(order by random()) rn from "AccountSamples"
) a
inner join (
    select "Name",    row_number() over(order by random()) rn from "TagNameSamples"
) tns on tns.rn = a.rn
inner join (
    select "Value",   row_number() over(order by random()) rn from "TagValueSamples"
) tvs on tvs.rn = a.rn
where a.rn <= 10

This is not the exact same logic as in your original query, because a given row may only be selected once - but I think that's a reasonable approximation.

If some of your tables may have more than 10 rows, then generate_series() and left joins are safer:

select uuid_generate_v4(), a."Account", tns."Name", tvs."Value"
from generate_series(1, 10) x(rn)
left join (
    select "Account", row_number() over(order by random()) rn from "AccountSamples"
) a on a.rn = x.rn
left join (
    select "Name",    row_number() over(order by random()) rn from "TagNameSamples"
) tns on tns.rn = x.rn
left join (
    select "Value",   row_number() over(order by random()) rn from "TagValueSamples"
) tvs on tvs.rn = x.rn

Upvotes: 2

Generate table with random values from other tables in PostgreSQL

Answers (2)

Related Questions