Brian D.
Brian D.

Reputation: 61

What's the best way to create RFC-4180-friendly CSV files from Amazon Redshift UNLOAD?

Amazon Redshift doesn't really have a CSV option when using UNLOAD (e.g., similar to the WITH CSV available in PostgreSQL's COPY command).

If you use the ESCAPE option, Redshift escapes double quotes with a \. Excel, for example, doesn't know what to do with that slash. It should really be using a double quote, per RFC-4180

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:

"aaa","b""bb","ccc"

Sooo... quotes are going to break exports unless we can figure out a way to make Redshift escape them properly.

Given the following data:

# select * from unload_test;
                 test_field                 | test_field_number
 --------------------------------------------+-------------------
 "No," she said, "that's not how we do it." |                 5
 Hi! This is a test, yo.                    |                 5

Is there an available work-around for this that will accommodate both quotes and commas in the data?

Upvotes: 6

Views: 548

Answers (1)

TautrimasPajarskas
TautrimasPajarskas

Reputation: 2796

Scaling solution was to process contents using perl / sed and convert it to proper CSV.

These flags are required: DELIMITER ',' ADDQUOTES ESCAPE.

aws s3 cp s3://mybucket/non_csv/part_0000 - | \
# Unescapes \" into "" but only when odd number of backslashes are found backwards
perl -pe 's/((?<![\\])(\\\\)*)\\"/\1""/g' | \
# Removes \ but only when odd number of backslashes are found backwards
perl -pe 's/((?<![\\])(\\\\)*)\\/\1/g' | \
# Unescapes \\ into \
sed -e 's/\\\\/\\/g' | \
aws s3 cp - s3://mybucket/csv/part_0000

I tested this using various edge cases on a real noisy data and it produces correct results. All backslashes are removed because nothing must be escaped except " -> "" and added quoting.

Example test:

$ echo '"","\"\"Hi\\\\\"","\\\\","\<tab>"' | perl -pe 's/((?<![\\])(\\\\)*)\\"/\1""/g' | perl -pe 's/((?<![\\])(\\\\)*)\\/\1/g' | sed -e 's/\\\\/\\/g'

Output:

"","""""Hi\\""","\\","<tab>"

PERL or other advanced regexp engine is required to do a correct look-behind.

Performance was not an issue, because I hit network bandwidth instead of CPU. However, one can use GNU parallel to process all output files at once.

However, I did not test aws s3 cp part, because I use another tool to get S3 contents into STDIN.

Cheers!

Upvotes: 1

Related Questions