aviss
aviss

Reputation: 2439

Loading data from AWS redshift using python

I'm facing a mission impossible to extract a huge amount of data from Amazone Redshift to another table. It definitely requires a more efficient approach but I'm new to SQL and AWS so decided to ask this smart community for advice.

This is my initial SQL query which takes forever:

-- STEP 1: CREATE A SAMPLE FOR ONE MONTH
SELECT DISTINCT at_id, utc_time, name
INTO my_new_table
FROM s3_db.table_x
WHERE type = 'create' 
AND (dt BETWEEN '20181001' AND '20181031');

What would be the best approach? I was thinking of using python and sqlalchemy to create dataframes with chunks of 1m rows and inserting it back into the new table (which I need to create beforehand). Would this work?:

from sqlalchemy import create_engine
import os
import pandas as pd

redshift_user = os.environ['REDSHIFT_USER']
redshift_password = os.environ['REDSHIFT_PASSWORD']

engine_string = "postgresql+psycopg2://%s:%s@%s:%d/%s" \
% (redshift_user, redshift_password, 'localhost', XXXX, 'redshiftdb')
engine = create_engine(engine_string)

for df in pd.read_sql_query("""
                        SELECT DISTINCT at_id, utc_time, name
                        INSERT INTO my_new_table
                        FROM s3_db.table_x
                        WHERE type = 'create' 
                        AND (dt BETWEEN '20181001' AND '20181031');
                       """, engine, chunksize=1000000):

Upvotes: 0

Views: 6869

Answers (3)

Manash Deb
Manash Deb

Reputation: 331

Please refer AWS guidelines for RedShift and Spectrum best practices; I've put the links at the end of this post. Based on your question, I am assuming you want to extract, transform and load huge amount of data from RedShift Spectrum based table "s3_db.table_x" to new RedShift table "my_new_table"

Here are some suggestions based on AWS recommendations:

  1. Create your RedShift table with appropriate distribution key, sort key and compression encoding. At high level, "at_id" seems best suited as partition key and "utc_time" as sortkey for your requirement, but make sure to refer AWS guidelines for RedShift table design 3.

  2. As you mentioned, your data volume is huge, you may like to have your S3 source table "s3_db.table_x" partitioned based on "type" and "dt" columns (as suggested at point number 4 in spectrum best practices 1).

  3. Replace DISTINCTwith GROUP BY in the select query from Spectrum (point number 9 in Spectrum Best Practices 1).

  4. AWS recommends (point number 7 in Spectrum best practices 1) to simplify your ETL process using CREATE TABLE AS SELECT or SELECT INTO statements, wherein you may put your transformation logic in the select component to load data directly form S3 to RedShift.

redshift spectrum best practices

redshift best practices

redshift table design playbook

Upvotes: 1

John Rotenstein
John Rotenstein

Reputation: 269101

It now appears that your source data is stored in Amazon S3 and you have been using a Redshift Spectrum table (that points to data in S3) as your source.

The preferred method would be:

  • Use the Amazon Redshift COPY command to load the data into a Redshift table
  • Use a CREATE TABLE AS command to extract (ETL) the data from the new Redshift table into your desired table. If you do this on a regular basis, you can use TRUNCATE and INSERT INTO to reload the table in future.

Upvotes: 0

John Rotenstein
John Rotenstein

Reputation: 269101

You should use CREATE TABLE AS.

This allows you to specify a SELECT statement and have the results directly stored into a new table.

This is hugely more efficient than downloading data and re-uploading.

You can also CREATE TABLE LIKE and then load it with data. See: Performing a Deep Copy

You could also UNLOAD data to Amazon S3, then load it again via COPY, but using CREATE TABLE AS is definitely the best option.

Upvotes: 3

Related Questions