Ryan
Ryan

Reputation: 305

Python script hangs when executing long running query, even after query completes

I've got a Python script that loops through folders and within each folder, executes the sql file against our Redshift cluster (using psycopg2). Here is the code that does the loop (note: this works just fine for queries that take only a few minutes to execute):

for folder in dir_list: 
    #Each query is stored in a folder by group, so we have to go through each folder and then each file in that folder
    file_list = os.listdir(source_dir_wkly + "\\" + str(folder))

    for f in file_list:
        src_filename = source_dir_wkly + "\\" + str(folder) + "\\" + str(f)
        dest_filename = dest_dir_wkly + "\\" + os.path.splitext(os.path.basename(src_filename))[0] + ".csv"
        result = dal.execute_query(src_filename)
        result.to_csv(path_or_buf=dest_filename,index=False)

execute_query is a method stored in another file:

def execute_query(self, source_path):
    conn_rs = psycopg2.connect(self.conn_string)
    cursor = conn_rs.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
    sql_file = self.read_sql_file(source_path)
    cursor.execute(sql_file)
    records = cursor.fetchall()
    conn_rs.commit()
    return pd.DataFrame(data=records)

def read_sql_file(self, path):
    sql_path = path
    f = open(sql_path, 'r')
    return f.read()

I have a couple queries that take around 15 minutes to execute (not unusual given the size of the data in our Redshift cluster), and they execute just fine in SQL Workbench. I can see in the AWS Console that the query has completed, but the script just hangs and doesn't dump the results to a csv file, nor does it proceed to the next file in the folder.

I don't have any timeouts specified. Is there anything else I'm missing?

Upvotes: 0

Views: 1389

Answers (1)

danny
danny

Reputation: 5270

The line records = cursor.fetchall() is likely the culprit. It reads all data and hence loads all results from the query into memory. Given that your queries are very large, that data probably cannot all be loaded into memory at once.

You should iterate over the results from the cursor and write into your csv one by one. In general trying to read all data from a database query at once is not a good idea.

You will need to refactor your code to do so:

for record in cursor:
    csv_fh.write(record)

Where csv_fh is a file handle to your csv file. Your use of pd.DataFrame will need rewriting as it looks like it expects all data to be passed to it.

Upvotes: 1

Related Questions