singularity2047
singularity2047

Reputation: 1071

How to reduce the time to write pandas dataframes as table in Amazon Redshift

I am writing python pandas data frame in Amazon Redshift using this -

df.to_sql('table_name', redshiftEngine, index = False, if_exists = 'replace' )

Although my dataframes have couple of thousand rows and 50-100 columns only, its taking 15-20 minutes to write one table. I wonder if that is normal performance in redshift ? Is there any way to optimize this process and speed up writing the table ?

Upvotes: 0

Views: 2025

Answers (2)

mah65
mah65

Reputation: 588

I agree that a better approach to upload data into Redshift is to upload file as a .csv file into S3 bucket, then read the file from there.

Firstly, save the file as a .csv file. If you are doing that in excel, then select 'CSV UTF-8 (Comma delimited)' to deal with special characters correctly. Then run these codes (modify sections accordingly). This is for a two columns table; add more if you have a larger table.

DROP TABLE IF EXISTS table.table;

CREATE TABLE IF NOT EXISTS table.table
(
    col1            VARCHAR(1024)   ENCODE lzo,
    col2            VARCHAR(1024)   ENCODE lzo
);


copy table.table
from 's3://address/address/address/file_name.csv'
CREDENTIALS 'aws_access_key_id=#################;aws_secret_access_key=####################'
FORMAT CSV
DELIMITER ','
IGNOREHEADER 1
region 'region_name'
--ACCEPTINVCHARS
;

Upvotes: 0

rvd
rvd

Reputation: 568

A better approach is use pandas to store your dataframe as a CSV, upload it to S3 and use the COPY functionality to load into Redshift. This approach can easily handle even hundreds of millions of rows. In general, Redshift write performance is not great - it's meant for processing data loads that are dumped in by huge ETL operations (like COPY).

Upvotes: 2

Related Questions