Heinrisch
Heinrisch

Reputation: 5935

Fetching rows not updated within last 24 hours

I have a large table (40+ million records) with a structure like the following:

CREATE TABLE collected_data(
    id TEXT NOT NULL,
    status TEXT NOT NULL,
    PRIMARY KEY(id, status),
    blob JSONB,
    updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);

I need to get all (or atleast 100,000) records that have a updated_at older than 24 hours, of a certain status, and have a blob that is not null.

So the query becomes:

SELECT
    id
FROM
    collected_data
WHERE
    status = 'waiting'
    AND blob IS NOT NULL
    AND updated_at < NOW() - '24 hours'::interval
LIMIT 100000;

Which results in the execution plan of something like:

Limit  (cost=0.00..234040.07 rows=100000 width=12)
  ->  Seq Scan on collected_data  (cost=0.00..59236150.00 rows=25310265 width=12)
"        Filter: ((blob IS NOT NULL) AND (type = 'waiting'::text) AND (updated_at >= (now() - '24:00:00'::interval)))"

It almost always results in a full table scan, which mean that some queries are really slow.

I have tried to create indexes like CREATE INDEX idx_special ON collected_data(status, updated_at); but it does not help.

Is there any way I can make this query faster?

Upvotes: 1

Views: 70

Answers (1)

jjanes
jjanes

Reputation: 44192

The planner thinks that 25,310,265 rows will meet your conditions, so it thinks it will be spoiled for choice in getting a mere 100,000 of them by a seq scan and then stopping early. If there isn't really that many, or there are that many but they are all clustered in the wrong part of the table, this won't actually be so fast. This is especially likely to be the case if, after selecting 100,000 of them, the next thing you do is update them in a way such they no longer meet the criteria. Because then you have to keep walking past the accumulating remnants of the ones that used to qualify, to find the next batch.

You can encourage it to use the index by adding 'order by updated_at' to your query. You could also stack the deck in your favor by creating a partial index CREATE INDEX ON collected_data(status, updated_at) where blob is not null or maybe CREATE INDEX ON collected_data(updated_at) where status='waiting' and blob is not null.

Upvotes: 1

Related Questions