Reputation: 1416
A ~500 gig table was unloaded to S3 using the default delimiter of '|'. The command looked something like this...
UNLOAD ('select * from some_table')
TO 's3://some-path/unloaded-tables/some-table/'
CREDENTIALS 'aws_access_key_id=xyz; aws_secret_access_key=abc'
MANIFEST;
I am attempting to reload this data back into some_table with a command like the following..
COPY some_table
FROM 's3://some-path/unloaded-tables/some-table/manifest'
CREDENTIALS 'aws_access_key_id=xyz; aws_secret_access_key=abc'
MANIFEST
DELIMITER '|';
Tests were run on a subset of the data to make sure it was reloadable. However, it turns out one of the columns in the dataset can contain a pipe (about %20 of records is seems). None such data was in the testing set.
Now when trying to load the data, the records containing the pipe fail to be parsed correctly.
I would find it surprising that the UNLOAD command does not escape delimiters found in records as it is unloading, but perhaps that is naive.
Any ideas on how I might be able to address this problem short of some heroic program that downloads each file from s3 and attempts fix the issue by hand? I am praying there is a magic COPY command parameter that will help here.
Upvotes: 0
Views: 681
Reputation: 5850
You have to tell Redshift to explicitly escape the delimiter in both your unload and copy commands, and also wrap all fields in quotes.
UNLOAD ('statement')
TO 's3://bucket'
CREDENTIALS 'aws_access_key_id=...;aws_secret_access_key=...'
GZIP
DELIMITER as '|'
ESCAPE ADDQUOTES ALLOWOVERWRITE;
Then copy
COPY table
FROM 's3path'
CREDENTIALS 'credentials'
GZIP REMOVEQUOTES ESCAPE DELIMITER '|'
FILLRECORD EMPTYASNULL BLANKSASNULL TIMEFORMAT 'auto'
COMPUPDATE off STATUPDATE off;
Upvotes: 3
Reputation: 1066
Create another unload:
UNLOAD ('select * from some_table')
TO 's3://some-path/unloaded-tables/some-table/'
CREDENTIALS 'aws_access_key_id=xyz; aws_secret_access_key=abc'
DELIMITER '|' ADDQUOTES;
Your copy
command will look like:
COPY some_table
FROM 's3://some-path/unloaded-tables/some-table/manifest'
CREDENTIALS 'aws_access_key_id=xyz; aws_secret_access_key=abc'
REMOVEQUOTES
DELIMITER '|';
Your unloaded data will have quotes around each column value, it will not consider pipe as delimiter if it is in quotes.
Upvotes: 1
Reputation: 1232
If you are okay with loading this data into some other table for once, then splitting it up into other table, this is what you can do -
Example - Say your data has 3 columns and 2 pipes, but one of them have an extra pipe. And you unloaded it to S3 with "|" delimiter.
Create a table (T1) with single column and length varchar(max)
Copy your unloaded data into this table with a delimiter that you are sure wont be present in your data - like \t or \001 (^A)
Create a new table (T2) that has required number of columns and data types.
For rows which only have no extra PIPES apart from deliiters - insert into your new table.
Query should be something like this -
insert into T2
select split_part(X,"|",1),
split_part(X,"|",2),
split_part(X,"|",3)
from T1
where len(X) - len(replace(X,"|","")) = 3;
For rows which have PIPEs other than delimiter, combine the splits into one and insert into T2.
insert into T2
select split_part(X,"|",1),
split_part(X,"|",2),
split_part(X,"|",3) || split_part(X,"|",4)
from T1
where len(X) - len(replace(X,"|","")) = 4;
Note:
len(X) - len(replace(X,"|","")) = 3;
shows number of PIPEs in your cell.
||
is concatenation
Let me know if you have any questions.
Upvotes: 2