rshepherd
rshepherd

Reputation: 1416

Reloading unloaded table data from S3

A ~500 gig table was unloaded to S3 using the default delimiter of '|'. The command looked something like this...

UNLOAD ('select * from some_table')
TO 's3://some-path/unloaded-tables/some-table/'
CREDENTIALS 'aws_access_key_id=xyz;  aws_secret_access_key=abc'
MANIFEST;

I am attempting to reload this data back into some_table with a command like the following..

COPY some_table 
FROM 's3://some-path/unloaded-tables/some-table/manifest' 
CREDENTIALS 'aws_access_key_id=xyz;  aws_secret_access_key=abc'
MANIFEST
DELIMITER '|';

Tests were run on a subset of the data to make sure it was reloadable. However, it turns out one of the columns in the dataset can contain a pipe (about %20 of records is seems). None such data was in the testing set.

Now when trying to load the data, the records containing the pipe fail to be parsed correctly.

I would find it surprising that the UNLOAD command does not escape delimiters found in records as it is unloading, but perhaps that is naive.

Any ideas on how I might be able to address this problem short of some heroic program that downloads each file from s3 and attempts fix the issue by hand? I am praying there is a magic COPY command parameter that will help here.

Upvotes: 0

Views: 681

Answers (3)

mastaBlasta
mastaBlasta

Reputation: 5850

You have to tell Redshift to explicitly escape the delimiter in both your unload and copy commands, and also wrap all fields in quotes.

UNLOAD ('statement')
TO 's3://bucket'
CREDENTIALS 'aws_access_key_id=...;aws_secret_access_key=...'
GZIP 
DELIMITER as '|'
ESCAPE ADDQUOTES ALLOWOVERWRITE;

Then copy

COPY table
FROM 's3path'
CREDENTIALS 'credentials'
GZIP REMOVEQUOTES ESCAPE DELIMITER '|'
FILLRECORD EMPTYASNULL BLANKSASNULL TIMEFORMAT 'auto'
COMPUPDATE off STATUPDATE off;

Upvotes: 3

vtuhtan
vtuhtan

Reputation: 1066

Create another unload:

UNLOAD ('select * from some_table')
TO 's3://some-path/unloaded-tables/some-table/'
CREDENTIALS 'aws_access_key_id=xyz;  aws_secret_access_key=abc'
DELIMITER '|' ADDQUOTES;

Your copy command will look like:

COPY some_table 
FROM 's3://some-path/unloaded-tables/some-table/manifest' 
CREDENTIALS 'aws_access_key_id=xyz;  aws_secret_access_key=abc'
REMOVEQUOTES
DELIMITER '|';

Your unloaded data will have quotes around each column value, it will not consider pipe as delimiter if it is in quotes.

Upvotes: 1

rohitkulky
rohitkulky

Reputation: 1232

If you are okay with loading this data into some other table for once, then splitting it up into other table, this is what you can do -

Example - Say your data has 3 columns and 2 pipes, but one of them have an extra pipe. Table And you unloaded it to S3 with "|" delimiter.

  1. Create a table (T1) with single column and length varchar(max)

  2. Copy your unloaded data into this table with a delimiter that you are sure wont be present in your data - like \t or \001 (^A)

    This is what the data will look like in the table - enter image description here

  3. Create a new table (T2) that has required number of columns and data types.

  4. For rows which only have no extra PIPES apart from deliiters - insert into your new table.

    Query should be something like this -

    insert into T2 
    select split_part(X,"|",1),
           split_part(X,"|",2),
           split_part(X,"|",3)
    from T1
    where len(X) - len(replace(X,"|","")) = 3;
    
  5. For rows which have PIPEs other than delimiter, combine the splits into one and insert into T2.

    insert into T2 
    select split_part(X,"|",1),
           split_part(X,"|",2),
           split_part(X,"|",3) || split_part(X,"|",4)
    from T1
    where len(X) - len(replace(X,"|","")) = 4;
    

Note:

len(X) - len(replace(X,"|","")) = 3; shows number of PIPEs in your cell.

|| is concatenation

Let me know if you have any questions.

Upvotes: 2

Related Questions