rdegges
rdegges

Reputation: 33824

Redshift: Serializable isolation violation on table

I have a very large Redshift database that contains billions of rows of HTTP request data.

I have a table called requests which has a few important fields:

I have a Python process running once per day, which grabs all distinct rows which have not yet been geocoded (do not have any city / state / country information), and then attempts to geocode each IP address via Google's Geocoding API.

This process (pseudocode) looks like this:

for ip_address in ips_to_geocode:
    country, state, city = geocode_ip_address(ip_address)
    execute_transaction('''
        UPDATE requests
        SET ip_country = %s, ip_state = %s, ip_city = %s
        WHERE ip_address = %s
    ''')

When running this code, I often receive errors like the following:

psycopg2.InternalError: 1023
DETAIL:  Serializable isolation violation on table - 108263, transactions forming the cycle are: 647671, 647682 (pid:23880)

I'm assuming this is because I have other processes constantly logging HTTP requests into my table, so when I attempt to execute my UPDATE statement, it is unable to select all rows with the ip address I'd like to update.

My question is this: what can I do to update these records in a sane way that will stop failing regularly?

Upvotes: 19

Views: 31240

Answers (4)

Dhwani Katagade
Dhwani Katagade

Reputation: 1292

Since you are doing a point update in your geo codes update process, while the other processes are writing to the table, you can intermittently get the Serializable isolation violation error depending on how and when the other process does its write to the same table.

Suggestions

  • One way is to use a table lock like Marcus Vinicius Melo has suggested in his answer.
  • Another approach is to catch the error and re run the transaction.

For any serializable transaction, it is said that the code initiating the transaction should be ready to retry the transaction in the face of this error. Since all transactions in Redshift are strictly serializable, all code initiating transactions in Redshift should be ready to retry them in the face of this error.

Explanations

The typical cause of this error is that two transactions started and proceeded in their operations in such a way that at least one of them cannot be completed as if they executed one after the other. So the db system chooses to abort one of them by throwing this error. This essentially gives control back to the transaction initiating code to take an appropriate course of action. Retry being one of them.

One way to prevent such a conflicting sequence of operations is to use a lock. But then it restricts many of the cases from executing concurrently which would not have resulted in a conflicting sequence of operations. The lock will ensure that the error will not occur but will also be concurrency restricting. The retry approach lets concurrency have its chance and handles the case when a conflict does occur.

Recommendation

That said, I would still recommend that you don't update Redshift in this manner, like point updates. The geo codes update process should write to a staging table, and once all records are processed, perform one single bulk update, followed by a vacuum if required.

Upvotes: 4

Marcus Vinicius Melo
Marcus Vinicius Melo

Reputation: 1158

Just got the same issue on my code, and this is how I fixed it:

First things first, it is good to know that this error code means you are trying to do concurrent operations in redshift. When you do a second query to a table before the first query you did moments ago was done, for example, is a case where you would get this kind of error (that was my case).

Good news is: there is a simple way to serialize redshift operations! You just need to use the LOCK command. Here is the Amazon documentation for the redshift LOCK command. It works basically making the next operation wait until the previous one is closed. Note that, using this command your script will naturally get a little bit slower.

In the end, the practical solution for me was: I inserted the LOCK command before the query messages (in the same string, separated by a ';'). Something like this:

LOCK table_name; SELECT * from ...

And you should be good to go! I hope it helps you.

Upvotes: 8

Rahul Gupta
Rahul Gupta

Reputation: 1802

Either you start a new session when you do second update on the same table or you have to 'commit' once you transaction is complete.

You can write set autocommit=on before you start updating.

Upvotes: 1

Paladin
Paladin

Reputation: 630

Your code is violating the serializable isolation level of Redshift. You need to make sure that your code is not trying to open multiple transactions on the same table before closing all open transactions.

You can achieve this by locking the table in each transaction so that no other transaction can access the table for updates until the open transaction gets closed. Not sure how your code is architected (synchronous or asynchronous), but this will increase the run time as each lock will force others to wait till the transaction gets over.

Refer: http://docs.aws.amazon.com/redshift/latest/dg/r_LOCK.html

Upvotes: 12

Related Questions