Reputation: 1410
(Sorry in advance for the long question. I tried to break it up into sections to make it clearer what I'm asking. Please let me know if I should add anything else or reorganize it at all.)
Background:
I'm writing a web crawler that uses a producer/consumer model with jobs (pages to crawl or re-crawl) stored in a postgresql database table called crawler_table. I'm using SQLAlchemy to access and make changes to the database table. The exact schema is not important for this question. The important thing is that I (will) have multiple consumers, each of which repeatedly selects a record from the table, loads the page with phantomjs, and then writes information about the page back to the record.
It can happen on occasion that two consumers select the same job. This is not itself a problem; however, it is important that if they update the record with their results simultaneously, that they make consistent changes. It's good enough for me to just find out if an update would cause the record to become inconsistent. If so, I can deal with it.
Investigation:
I initially assumed that if two transactions in separate sessions read then updated the same record simultaneously, the second one to commit would fail. To test that assumption, I ran the following code (simplified slightly):
SQLAlchemySession = sessionmaker(bind=create_engine(my_postgresql_uri))
class Session (object):
# A simple wrapper for use with `with` statement
def __enter__ (self):
self.session = SQLAlchemySession()
return self.session
def __exit__ (self, exc_type, exc_val, exc_tb):
if exc_type:
self.session.rollback()
else:
self.session.commit()
self.session.close()
with Session() as session: # Create a record to play with
if session.query(CrawlerPage) \
.filter(CrawlerPage.url == 'url').count() == 0:
session.add(CrawlerPage(website='website', url='url',
first_seen=datetime.utcnow()))
page = session.query(CrawlerPage) \
.filter(CrawlerPage.url == 'url') \
.one()
page.failed_count = 0
# commit
# Actual experiment:
with Session() as session:
page = session.query(CrawlerPage) \
.filter(CrawlerPage.url == 'url') \
.one()
print 'initial (session)', page.failed_count
# 0 (expected)
page.failed_count += 5
with Session() as other_session:
same_page = other_session.query(CrawlerPage) \
.filter(CrawlerPage.url == 'url') \
.one()
print 'initial (other_session)', same_page.failed_count
# 0 (expected)
same_page.failed_count += 10
print 'final (other_session)', same_page.failed_count
# 10 (expected)
# commit other_session, no errors (expected)
print 'final (session)', page.failed_count
# 5 (expected)
# commit session, no errors (why?)
with Session() as session:
page = session.query(CrawlerPage) \
.filter(CrawlerPage.url == 'url') \
.one()
print 'final value', page.failed_count
# 5 (expected, given that there were no errors)
(Apparently Incorrect) Expectations:
I would have expected that reading a value from a record then updating that value within the same transaction would:
Questions:
Upvotes: 2
Views: 1722
Reputation: 95722
PostgreSQL has select . . . for update
, which SQLAlchemy seems to support.
My rationale is that all transactions should behave as though they are performed independently in order of commit whenever possible, or should fail to commit.
Well, in general there's a lot more to transactions than that. PostgreSQL's default transaction isolation level is "read committed". Loosely speaking, that means multiple transactions can simultaneously read committed values from the same rows in a table. If you want to prevent that, set transaction isolation serializable
(might not work), or select...for update
, or lock the table, or use a column-by-column WHERE clause, or whatever.
You can test and demonstrate transaction behavior by opening two psql connections.
begin transaction; begin transaction; select * from test where pid = 1 and date = '2014-10-01' for update; (1 row) select * from test where pid = 1 and date = '2014-10-01' for update; (waiting) update test set date = '2014-10-31' where pid = 1 and date = '2014-10-01'; commit; -- Locks released. SELECT for update fails. (0 rows)
Upvotes: 1