Reputation: 23
So, I am trying to load data into MySQL using LOAD DATA LOCAL INFILE. About 2,400,000 rows.
Some of these rows have lines with a line breaks inside the data and are not being inserted.
One thing we are sure about is that the last character before the actual linebreak is a "
(double quote) so we can change all linebreaks next to a "
to be something else like "***\r\n
.
This will enable me to add the statement LINES TERMINATED BY '***\r\n'
instead of just \r\n
E.g
One entry in my input file: in.csv (Line terminated by \r\n)
1,223,"{...}","Some title with
line breaks"\r\n
Wanted output after processing file:
1,223,"{...}","Some title with
line breaks"***\r\n
`
An alternative would be to remove all line breaks within the CSV data and only have line breaks at the end of the row. Problem is I'm not sure how to do that fast enough since I am dealing with really large files (2GB+)
Now, I have tried sed unsuccessfully. I think I am missing something. Here is my closest attempt.
sed ':a;N;$!ba;s/"$/***"\r\n/g' in.csv > out.csv
However, this did not work for me.
Thanks.
***UPDATE***
I realized that all records were being inserted by load data query. I thought otherwise because the line-count by using wc-l
was different from select count(*)
.
Then I realized that when wc-l encounters data with embedded linebreaks, it regards it as a separate row instead of regarding it as part of the same row.
I thank you all for your efforts.
Upvotes: 2
Views: 100
Reputation: 627100
You need to bear in mind that $
only matches before a newline, LF, char. You have a carriage return before a newline.
So, you need to make sure you match "
either before a newline or before a CRLF:
sed -E ':a;N;$!ba;s/"\r?$/***"\r\n/g' in.csv > out.csv
Here, the POSIX ERE "\r?$
pattern matches a "
char, then an optional carriage return, and then asserts the position at the end of the string.
Test of a UTF8 encoded text file with CRLF endings
yields
Upvotes: 1
Reputation: 522181
Here is a PHP script which should meet your needs:
$fin = fopen("input.txt", "r");
$fout = fopen("output.txt", "w");
while(!feof($fin)) {
$line = preg_replace("/(?<!\")\r?\n/", "", fgets($fin));
fwrite($fout, $line);
}
fclose($fin);
fclose($fout);
This solution uses the regex pattern (?<!\")\r?\n
to target CR?LF only if it does not immediately follow a double quote.
Upvotes: 1