Reputation: 668
I have a csv line like this:
"aaa"|"bbb"|"ccc"|"dddd
eeeee"
I want to import this data using pgloader (http://pgloader.io/) or PostgreSQL COPY (https://www.postgresql.org/docs/9.6/static/sql-copy.html). My issue here is the fact tha according to the CSV standards it's possible to have a line break (\n) inside a quoted field value. But pgloader and also the COPY command treat it like a brand new line of data, instead of one column with line break inside.
COPY schema.table (
col_aaa,
col_bbb,
col_ccc,
col_ddd
)
FROM 'file.csv' WITH DELIMITER '|' ENCODING 'LATIN1' CSV;
My COPY command My pgloader command
LOAD CSV
FROM 'file.csv'
INTO postgresql://user:password@host:5432/database?schema.table (col_aaa, col_bbb, col_ccc, col_ddd)
WITH
skip header = 0,
fields optionally enclosed by '"',
fields escaped by double-quote,
fields terminated by '|'
SET client_encoding to 'latin1'
BEFORE LOAD DO
$$ TRUNCATE anac.aerodromos_csv RESTART IDENTITY; $$;
I've search a lot from the PostgreSQL docs and also from google.
The only thing that I've found was this: Parsing CSV file with \n in double quoted fields but awk is too slow for a file with more than 1 million lines.
Any hint about how to do it?
My preference is for pgloader, but I can accept the use of sed or perl as regex agent to treat file at a linux shell script.
Any clues about how to do it?
Upvotes: 1
Views: 2167
Reputation: 668
The problem was that at the end of line there was an \r\n. And to solve that I removed the \r only from the line break inside the data.
That way pgloader was able to do the job.
I've done that by using an one line perl.
perl -0777 -pi -e 's/(?<="[^"|])*(?<!["|])\r\n(?=[^"]*")/\n/smg' $csv_file_name # O -0777 is explained at em https://stackoverflow.com/questions/9670426/perl-command-line-multi-line-replace
Perl command line multi-line replace
Upvotes: 0
Reputation: 1113
To give you an idea to solve this problem, I am writing this example.
I have just assumed that file will contain only 4 column and will contain only 1 line break. If It isn't the case then you need to change this.
Input File:
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"dddd
eeeee"
"aaa"|"bbb"|"cc
c"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"b
bb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"a
aa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"dddd
eeeee"
script.pl
#!/usr/bin/perl
use strict; use warnings; use Data::Dumper;
open ( my $RFH, '<', 'input.io' ) or die ($!);
open ( my $WFH, '>', 'output.o' ) or die ($!);
my $line_break = 0;
my $old_line = '';
while ( my $line = <$RFH> ) {
chomp($line);
if ( ! $line_break ){
my @columns = split( /\|/, $line );
if ( scalar( @columns ) == 4 && $columns[3] =~ m/"$/ ){
print $WFH $line."\n";
}
else{
$line_break = 1;
$old_line = $line;
next;
}
}
else{
$line = $old_line . $line;
$old_line = '';
$line_break = 0;
print $WFH $line."\n";
}
}
close($RFH);
close($WFH);
OUTPUT File:
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
Change this example to your need. Hope this helps.
Upvotes: -1