Reputation: 2439
I'm facing a mission impossible to extract a huge amount of data from Amazone Redshift to another table. It definitely requires a more efficient approach but I'm new to SQL and AWS so decided to ask this smart community for advice.
This is my initial SQL query which takes forever:
-- STEP 1: CREATE A SAMPLE FOR ONE MONTH
SELECT DISTINCT at_id, utc_time, name
INTO my_new_table
FROM s3_db.table_x
WHERE type = 'create'
AND (dt BETWEEN '20181001' AND '20181031');
What would be the best approach? I was thinking of using python and sqlalchemy to create dataframes with chunks of 1m rows and inserting it back into the new table (which I need to create beforehand). Would this work?:
from sqlalchemy import create_engine
import os
import pandas as pd
redshift_user = os.environ['REDSHIFT_USER']
redshift_password = os.environ['REDSHIFT_PASSWORD']
engine_string = "postgresql+psycopg2://%s:%s@%s:%d/%s" \
% (redshift_user, redshift_password, 'localhost', XXXX, 'redshiftdb')
engine = create_engine(engine_string)
for df in pd.read_sql_query("""
SELECT DISTINCT at_id, utc_time, name
INSERT INTO my_new_table
FROM s3_db.table_x
WHERE type = 'create'
AND (dt BETWEEN '20181001' AND '20181031');
""", engine, chunksize=1000000):
Upvotes: 0
Views: 6869
Reputation: 331
Please refer AWS guidelines for RedShift and Spectrum best practices; I've put the links at the end of this post. Based on your question, I am assuming you want to extract, transform and load huge amount of data from RedShift Spectrum based table "s3_db.table_x" to new RedShift table "my_new_table"
Here are some suggestions based on AWS recommendations:
Create your RedShift table with appropriate distribution key, sort key and compression encoding. At high level, "at_id" seems best suited as partition key and "utc_time" as sortkey for your requirement, but make sure to refer AWS guidelines for RedShift table design 3.
As you mentioned, your data volume is huge, you may like to have your S3 source table "s3_db.table_x" partitioned based on "type" and "dt" columns (as suggested at point number 4 in spectrum best practices 1).
Replace DISTINCT
with GROUP BY
in the select query from Spectrum (point number 9 in Spectrum Best Practices 1).
AWS recommends (point number 7 in Spectrum best practices 1) to simplify your ETL process using CREATE TABLE AS SELECT
or SELECT INTO
statements, wherein you may put your transformation logic in the select component to load data directly form S3 to RedShift.
redshift spectrum best practices
redshift table design playbook
Upvotes: 1
Reputation: 269101
It now appears that your source data is stored in Amazon S3 and you have been using a Redshift Spectrum table (that points to data in S3) as your source.
The preferred method would be:
COPY
command to load the data into a Redshift tableCREATE TABLE AS
command to extract (ETL) the data from the new Redshift table into your desired table. If you do this on a regular basis, you can use TRUNCATE
and INSERT INTO
to reload the table in future.Upvotes: 0
Reputation: 269101
You should use CREATE TABLE AS.
This allows you to specify a SELECT
statement and have the results directly stored into a new table.
This is hugely more efficient than downloading data and re-uploading.
You can also CREATE TABLE LIKE
and then load it with data. See: Performing a Deep Copy
You could also UNLOAD
data to Amazon S3, then load it again via COPY
, but using CREATE TABLE AS
is definitely the best option.
Upvotes: 3