Reputation: 61
Amazon Redshift doesn't really have a CSV option when using UNLOAD
(e.g., similar to the WITH CSV
available in PostgreSQL's COPY
command).
If you use the ESCAPE
option, Redshift escapes double quotes with a \
. Excel, for example, doesn't know what to do with that slash. It should really be using a double quote, per RFC-4180
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:
"aaa","b""bb","ccc"
Sooo... quotes are going to break exports unless we can figure out a way to make Redshift escape them properly.
Given the following data:
# select * from unload_test;
test_field | test_field_number
--------------------------------------------+-------------------
"No," she said, "that's not how we do it." | 5
Hi! This is a test, yo. | 5
ESCAPE
but no ADDQUOTES
, Excel et al continues to break columns on all commas including the "escaped" ones (\,
). ExampleESCAPE
and ADDQUOTES
you get the result you see in the initial issue example.ADDQUOTES
but we don't use ESCAPE
then it will break on double-quotes in the data. ExampleIs there an available work-around for this that will accommodate both quotes and commas in the data?
Upvotes: 6
Views: 548
Reputation: 2796
Scaling solution was to process contents using perl / sed and convert it to proper CSV.
These flags are required: DELIMITER ',' ADDQUOTES ESCAPE
.
aws s3 cp s3://mybucket/non_csv/part_0000 - | \
# Unescapes \" into "" but only when odd number of backslashes are found backwards
perl -pe 's/((?<![\\])(\\\\)*)\\"/\1""/g' | \
# Removes \ but only when odd number of backslashes are found backwards
perl -pe 's/((?<![\\])(\\\\)*)\\/\1/g' | \
# Unescapes \\ into \
sed -e 's/\\\\/\\/g' | \
aws s3 cp - s3://mybucket/csv/part_0000
I tested this using various edge cases on a real noisy data and it produces correct results. All backslashes are removed because nothing must be escaped except "
-> ""
and added quoting.
Example test:
$ echo '"","\"\"Hi\\\\\"","\\\\","\<tab>"' | perl -pe 's/((?<![\\])(\\\\)*)\\"/\1""/g' | perl -pe 's/((?<![\\])(\\\\)*)\\/\1/g' | sed -e 's/\\\\/\\/g'
Output:
"","""""Hi\\""","\\","<tab>"
PERL or other advanced regexp engine is required to do a correct look-behind.
Performance was not an issue, because I hit network bandwidth instead of CPU. However, one can use GNU parallel to process all output files at once.
However, I did not test aws s3 cp
part, because I use another tool to get S3 contents into STDIN.
Cheers!
Upvotes: 1