roarster
roarster

Reputation: 4086

Redshift Concurrent Transactions

I'm having issues concurrently writing to a Redshift db. Writing to the db using a single connection works well, but is a little slow, so I am trying to use multiple concurrent connections but it looks like there can only be a single transaction at a time.

I investigated by running the following python script alone, and then running it 4 times simultaneously.

import psycopg2
import time
import os

if __name__ == "__main__":
  rds_conn = psycopg2.connect(host="www.host.com", port="5439", dbname='db_name', user='db_user', password='db_pwd')
  cur = rds_conn.cursor()
  with open("tmp/test.query", 'r') as file:
    query = file.read().replace('\n', '')

  counter = 0
  start_time = time.time()
  try:
    while True:
      cur.execute(query)
      rds_conn.commit() # first commit location
      print("sent couter: %s" % counter)
      counter += 1
  except KeyboardInterrupt:
    # rds_conn.commit() # secondary commit location
    total_time = time.time() - start_time
    queries_per_sec = counter / total_time
    print("total queries/sec: %s" % queries_per_sec)

The test.query file being loaded up is a multi-row insert file ~16.8mb that looks a little like:

insert into category_stage values
(default, default, default, default),
(20, default, 'Country', default),
(21, 'Concerts', 'Rock', default);

(Just a lot longer)

The results of the scripts showed:

---------------------------------------------------
| process count | queries/sec | total queries/sec |
---------------------------------------------------
| 1             | 0.1786      | 0.1786            |
---------------------------------------------------
| 8             | 0.0359      | 0.2872            |
---------------------------------------------------

...which is far from the increase I'm looking for. When you can see the counter increasing across the scripts there's a clear circular pattern where each waits for the prior script's query to finish.

When the commit is moved from the first commit location to the second commit location (so commits only when the script is interrupted), only one script advances at a time. If that isn't a clear indication of some sort of transaction lock, I don't know what is.

As far as I can tell from searching, there's no document that says we can't have concurrent transactions, so what could the problem be? It's crossed my mind that the query size is so large that only one can be performed at a time, but I would have expected Redshift to have much more than ~17mb per transaction.

Upvotes: 1

Views: 2351

Answers (1)

roarster
roarster

Reputation: 4086

Inline with Guy's comment, I ended up using a COPY from an S3 bucket. This ended up being an order of magnitude faster, requiring only a single thread to call the query, and then allowing AWS to process the files from S3 in parallel. I used the guide detailed here and managed to insert about 120Gb of data in just over an hour.

Upvotes: 1

Related Questions