Sara Fuerst
Sara Fuerst

Reputation: 6088

Optimization of selection of semi-random rows in Postgres

I currently have a query that randomly selects a job from a table of jobs:

select jobs.job_id
from jobs
where (jobs.type is null)
  and (jobs.project_id = 5)
  and (jobs.status = 'Available')
offset floor(random() * (select count(*) from jobs
                         where (jobs.type is null) and (jobs.project_id = 5)
                           and (jobs.status = 'Available')))
limit 1

This has the desired functionality, but is too slow. I am using Postgres 9.2 so I can't use TABLESAMPLE, unfortunately.

On the plus side, I do not need it to be truly random, so I'm thinking I can optimize it by making it slightly less random.

Any ideas?

Upvotes: 0

Views: 105

Answers (3)

wildplasser
wildplasser

Reputation: 44250

A dirty trick: store the random value inside the table and build a (partial) index on it. (you may want to re-randomise this field from time to time to avoid records to never be picked ;-)

-- assumed table definition
CREATE table jobs
        ( job_id SERIAL NOT NULL PRIMARY KEY
        , type VARCHAR
        , project_id INTEGER NOT NULL
        , status VARCHAR NOT NULL
        -- pre-computed magic random number
        -- , magic DOUBLE PRECISION NOT NULL DEFAULT random()
        );

-- some data to play with
INSERT INTO jobs(type,project_id,status)
SELECT 'aaa' , gs %10 , 'Omg!'
FROM generate_series(1,10000) gs;
UPDATE jobs SET type = NULL WHERE random() < 0.2;
UPDATE jobs SET status = 'Available'  WHERE random() < 0.2;

-- add a column containing random numbers
ALTER TABLE jobs
    ADD column magic DOUBLE PRECISION NOT NULL DEFAULT random()
    ;

CREATE INDEX ON jobs(magic)
    -- index is only applied for the conditions you will be searching
    WHERE status = 'Available' AND project_id = 5 AND type IS NULL
    ;

-- make sure statistics are present
VACUUM ANALYZE jobs;

-- EXPLAIN
SELECT j.job_id
FROM jobs j
WHERE j.type is null
  AND j.project_id = 5
  AND j.status = 'Available'
ORDER BY j.magic
LIMIT 1
        ;

Something similar can be accomplished by using a serial with a rather high increment value (some prime number around 3G) instead of random+float.

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1270421

Could I suggest an index on jobs(project_id, status, type)? That might speed your query, if it is not already defined on the table.

Upvotes: 1

Laurenz Albe
Laurenz Albe

Reputation: 247235

Instead of using OFFSET and LIMIT, why don't you use

ORDER BY random() LIMIT 1

If that is also too slow, you could replace your subquery

select count(*) from jobs
   where (jobs.type is null) and (jobs.project_id = 5)
      and (jobs.status = 'Available')

with something like

SELECT reltuples * <factor> FROM pg_class WHERE oid = <tableoid>

where <tableoid> is the OID of the jobs table and <factor> is a number that is slightly bigger than the selectivity of the WHERE condition of your subquery.

That will save one sequential scan, with the downside that you occasionally get no result and have to repeat the query.

Is that good enough?

Upvotes: 0

Related Questions