Reputation: 2934
How do relational databases maintain your set of results if another query has edited those rows that you were working on?
For instance if I select
100k rows and while I am still fetching those another query does an update
on a row that hasn't been read yet I will not see the update
.
Upvotes: -1
Views: 223
Reputation: 161
Snapshot isolation solves the problem you are describing. If you use locking, you can see the same record twice as the iterator through the database, as the unlocked records change underneath your feet as you're doing the scan.
Read committed isolation level with locking suffers from this problem.
Depending on the granularity of the lock, the WHERE predicate may lock matching pages and tuples for locking so that the running read query doesn't see phantom data appearing (phantom reads)
I implemented multiversion concurrency control in my Java project. A transaction is given a monotonically increasing timestamp which starts at 0 and goes up by 1 each time the transaction is aborted. Later transactions have higher timestamps. When a transaction goes to read, it can only see data that has a timestamp that is less than or equal to itself and is committed for that key (or column of that tuple). (Equal to so it can see its own writes)
When a transaction writes, it updates the committed timestamp for that key to that of that transactions timestamp.
Upvotes: 0
Reputation: 15158
The general goal you are describing in concurrent programming (Wikipedia concurrency control) is serialization (Wikipedia serializability): an implementation manages the database as if transactions occurred without overlap in some order.
The importance of that is that only then does the system act in a way described by the code as we normally interpret it. Otherwise the results of operations are a combination of all processes acting concurrently. Nevertheless by having limited categories of non-normal non-isolated so-called anomalous behaviours arise transaction throughput can be increased. So those implementation techniques are also apropos. (Eg MVCC.) But understand: such non-serialized behaviour is not isolating one transaction from another. (Ie so-called "isolation" levels are actually non-isolation levels.)
Isolation is managed by breaking transaction implementations into pieces based on reading and writing shared resources and executing them interlaced with pieces from other transactions in such a way that the effect is the same as some sequence of non-overlapped execution. Roughly speaking, one can "pessimistically" "lock" out other processes from changed resources and have them wait or "optimistically" "version" the database and "roll back" (throw away) some processes' work when changes are unreconcilable (unserializable).
Some techniques based on an understanding of serializability by an implementer for a major product are in this answer. For relevant notions and techniques, see the Wikipedia articles or a database textbook. Eg Fundamentals of database systems by Ramez Elmasri & Shamkant B. Navathe. (Many textbooks, slides and courses are free online.)
(Two answers and a comment to your question mention MVCC. As I commented, not only is MVCC just one implementation technique, it doesn't even support transaction serialization, ie actually isolating transactions as if each was done all at once. It allows certain kinds of errors (aka anomalies). It must be mixed with other techniques for isolation. The MVCC answers, comments and upvoting reflects a confusion between a popular and valuable technique for a useful and limited failure to isolate per your question vs the actual core issues and means.)
Upvotes: 1
Reputation: 324891
As Jayadevan notes, the general principle used by most widely used databases that permit you to modify values while they're being read is multi-version concurrency control or MVCC. All widely used modern RDBMS implementations that support reading rows that're being updated rely on some concept of row versioning.
The details differ between implementations. (I'll explain PostgreSQL's a little more here, but you should accept Jayadevan's answer not mine).
PostgreSQL uses transaction ID ranges to track row visibility. So there'll be multiple copies of a tuple in a table, but any given transaction can only "see" one of them. Each transaction has a unique ID, with newer transactions having newer IDs. Each tuple has hidden xmin
and xmax
fields that track which transactions can "see" the tuple. Insertion is implimented by setting the tuple's xmin
so that transactions with lower xids know to ignore the tuple when reading the table. Deletion is implimented by setting the tuple's xmax
so that transactions with higher xids know to ignore the tuple when reading the table. Updates are implemented by effectively deleting the old tuple (setting xmax
) then inserting a new one (setting xmin
) so that old transactions still see the old tuple, but new transactions see the new tuple. When no running transaction can still see a deleted tuple, vacuum
marks it as free space to be overwritten.
Oracle uses undo and redo logs, so there's only one copy of the tuple in the main table, and transactions that want to find old versions have to go and find it in the logs. Like PostgreSQL it uses row versioning, though I'm less sure of the details.
Pretty much every other DB uses a similar approach these days. Those that used to rely on locking, like older MS-SQL versions, have moved to MVCC.
MySQL uses MVCC with InnoDB tables, which are now the default. MyISAM tables still rely on table locking (but they'll also eat your data, so don't use them for anything you care about).
A few embedded DBs, like SQLite, still rely only on table locking - which tends to require less wasted disk space and I/O overhead, at the cost of greatly reduced concurrency. Some databases let you bypass MVCC if you take an exclusive lock on a table.
(Marked community wiki, since I also close-voted this question).
You should also read the PostgreSQL docs on transaction isolation and locking, and similar documentation for other DBs you use. See the Wikipedia article on isolation.
Upvotes: 0
Reputation: 1342
Please lookup Multi Version Concurrency Control. Different databases have different approaches to managing this. For MySQL, InnoDB, you can try http://dev.mysql.com/doc/refman/5.0/en/innodb-multi-versioning.html. PostgreSQL - https://wiki.postgresql.org/wiki/MVCC. A great presentation here - http://momjian.us/main/writings/pgsql/mvcc.pdf. It is explained in stackoverflow in this thread Database: What is Multiversion Concurrency Control (MVCC) and who supports it?
Upvotes: 3