Reputation: 3774
I'm currently building a crawler. Multiple crawling workers access the same PostgreSQL database. Sadly I'm encountering issues with the main transaction presented here:
BEGIN ISOLATION LEVEL SERIALIZABLE;
UPDATE webpages
SET locked = TRUE
WHERE url IN
(
SELECT DISTINCT ON (source) url
FROM webpages
WHERE
(
last IS NULL
OR
last < refreshFrequency
)
AND
locked = FALSE
LIMIT limit
)
RETURNING *;
COMMIT;
url
is a URL (String)source
is a domain name (String)last
is the last time a page was crawled (Date)locked
is a boolean that is set to indicate that a webpage is currently being crawled (Boolean)I tried two different transaction isolation levels:
ISOLATION LEVEL SERIALIZABLE
, I get errors like could not serialize access due to concurrent update
ISOLATION LEVEL READ COMMITTED
, I get duplicate url
s from concurrent transactions due to the data being "frozen" from the time the transaction was first committed (I think)I'm fairly new to PostgreSQL and SQL in general so I'm really not sure what I could do to fix this issue.
Update:
PostgreSQL version is 9.2.x.
webpage
table definition:
CREATE TABLE webpages (
last timestamp with time zone,
locked boolean DEFAULT false,
url text NOT NULL,
source character varying(255) PRIMARY KEY
);
Upvotes: 2
Views: 4072
Reputation: 658582
The question leaves room for interpretation. This is how I understand the task:
Lock a maximum of limit
URLs which fulfill some criteria and are not locked, yet. To spread out the load on sources, every URL should come from a different source.
Assuming a separate table source
: this makes the job faster and easier. If you don't have such a table, create it, it's the proper design anyway:
CREATE TABLE source (
source_id serial NOT NULL PRIMARY KEY
, source text NOT NULL
);
CREATE TABLE webpage (
source_id int NOT NULL REFERENCES source
url text NOT NULL PRIMARY KEY
locked boolean NOT NULL DEFAULT false, -- may not be needed
last timestamp NOT NULL DEFAULT '-infinity' -- makes query simpler
);
Alternatively you can use a recursive CTE efficiently:
I am using advisory locks to make this safe and cheap even in default read committed
isolation level:
UPDATE webpage w
SET locked = TRUE
FROM (
SELECT (SELECT url
FROM webpage
WHERE source_id = s.source_id
AND (last >= refreshFrequency) IS NOT TRUE
AND locked = FALSE
AND pg_try_advisory_xact_lock(url) -- only true is free
LIMIT 1 -- get 1 URL per source
) AS url
FROM (
SELECT source_id -- the FK column in webpage
FROM source
ORDER BY random()
LIMIT limit -- random selection of "limit" sources
) s
FOR UPDATE
) l
WHERE w.url = l.url
RETURNING *;
Alternatively, you could work with only advisory locks and not use the table column locked
at all. Basically just run the the SELECT
statement. Locks are kept until the end of the transaction. You can use pg_try_advisory_lock()
instead to keep the locks till the end of the session. Only UPDATE
once to set last
when done (and possible release the advisory lock).
In Postgres 9.3 or later you would use a LATERAL
join instead of the correlated subquery.
I chose pg_try_advisory_xact_lock()
because the lock can (and should) be released at the end of the transaction. Detailed explanation for advisory locks:
You get less than limit
rows if some sources have no more URL to crawl.
The random selection of sources is my wild but educated guess, since information is not available. If your source
table is big, there are faster ways:
refreshFrequency
should really be called something like lastest_last
, since it's not a "frequency", but a timestamp
or date
.
To get the full limit number of rows if available, use a RECURSIVE
CTE and iterate all sources until you found enough or no more can be found.
As I mentioned above, you may not need the column locked
at all and operate with advisory locks only (cheaper). Just set last
at the end of the transaction, before you start the next round.
WITH RECURSIVE s AS (
SELECT source_id, row_number() OVER (ORDER BY random()) AS rn
FROM source -- you might exclude "empty" sources early ...
)
, page(source_id, rn, ct, url) AS (
SELECT 0, 0, 0, ''::text -- dummy init row
UNION ALL
SELECT s.source_id, s.rn
, CASE WHEN t.url <> ''
THEN p.ct + 1
ELSE p.ct END -- only inc. if url found last round
, (SELECT url
FROM webpage
WHERE source_id = t.source_id
AND (last >= refreshFrequency) IS NOT TRUE
AND locked = FALSE -- may not be needed
AND pg_try_advisory_xact_lock(url) -- only true is free
LIMIT 1 -- get 1 URL per source
) AS url -- try, may come up empty
FROM page p
JOIN s ON s.rn = p.rn + 1
WHERE CASE WHEN p.url <> ''
THEN p.ct + 1
ELSE p.ct END < limit -- your limit here
)
SELECT url
FROM page
WHERE url <> ''; -- exclude '' and NULL
Alternatively, if you need to manage locked
, too, use this query with the above UPDATE
.
You will love SKIP LOCKED
in the the upcoming Postgres 9.5:
Related:
Upvotes: 3
Reputation: 36127
First try:
UPDATE webpages
SET locked = TRUE
WHERE url IN
(
SELECT DISTINCT ON (source) url
FROM webpages
WHERE
(
last IS NULL
OR
last < refreshFrequency
)
AND
locked = FALSE
LIMIT limit
)
WHERE
(
last IS NULL
OR
last < refreshFrequency
)
AND
locked = FALSE
You are trying to update only records with locked = FALSE
.
Imagine that there are the following records in the table:
URL locked
----------------
A false
A true
The subquery in your update will retrun A
.
Then the outer update will perform:
UPDATE webpages
SET locked = TRUE
WHERE url IN ( 'A' )
and in effect all records in the table containing url=A
will be updated,
regardess of their values in locked
column.
You need to apply to the outer update the same WHERE
condition as in the subquery.
Upvotes: 0