MMacphail
MMacphail

Reputation: 561

How can I prevent my SQL subquery from caching data in PostgreSQL?

I'm trying to insert test data to test postgresql partionning, using Postgresql 11.

Here is my table:

CREATE TABLE measurement (
    city_id         int not null,
    logdate         date not null,
    peaktemp        int,
    unitsales       int
);

Here is my statement to insert test data:

explain insert into measurement(
  city_id,
  logdate,
  peaktemp,
  unitsales
)
select
  ( 
    select (random() * 10000)::int + (generator*0) as city_id 
  ),
  ( 
    select 
        * 
    from 
        (select * from generate_series('2006-01-01'::timestamp,
                                       '2006-12-31'::timestamp,
                                       '1 day')) as rng
    order by
        random()
    limit 1
  ) as logdate,
  ( 
    select (random() * 10000)::int + (generator*0) as peaktemp 
  ),
  ( 
    select (random() * 10000)::int + (generator*0) as unitsales 
  )
from generate_series(1, 1000) as generator

The problem is: when I run this statement, the select subquery which generates the field logdate seems to be cached and the table measurement contains a single value for the field logdate.

So, if I run the following query:

select distinct(logdate)
from measurement

I only get a single value: 2006-02-10

How can I prevent the caching side-effect?

Upvotes: 1

Views: 525

Answers (2)

richyen
richyen

Reputation: 9968

As you've observed, the ORDER BY is calculated once for the sub-query, such that for each row that's generated in your outer query, the selected row is going to be the same. As such, you won't be able to use sub-selects and ORDER BY to achieve what you're aiming for, even though you call random().

Try this:

SELECT
  (random() * 10000)::int AS city_id,
  '2006-01-01'::timestamp + random() * ('2006-12-31'::timestamp - '2006-01-01'::timestamp) AS logdate,
  (random() * 10000)::int AS peaktemp,
  (random() * 10000)::int AS unitsales
FROM
generate_series(1,1000);

I'm not sure what your requirements are, but I don't think you need to have 1) a bunch of sub-selects or 2) all the references to generator*0, so I've removed them too.

Disclosure: I work for EnterpriseDB (EDB)

Upvotes: 3

Tometzky
Tometzky

Reputation: 23880

Maybe something like this:

select
  (random() * 10000)::int as city_id,
  (
    '2006-01-01'::timestamptz
    +random()*(
      ('2006-01-01'::timestamptz+'1 year'::interval)
      -'2006-01-01'::timestamptz
    )
  ) as logdate,
  (random() * 10000)::int as peaktemp,
  (random() * 10000)::int as unitsales
from generate_series(1, 1000) as generator;

Although I'm confused whether you want a timestamptz (you definitely don't want timestamp, without timezone, as this type is evil) or a date in logdate.

Upvotes: 1

Related Questions