Reputation: 33
We've got a data source that sends us files we need to automatically import into our system, but frequently there is bad data in them in the form of extra CRLFs inside one or two of the fields of some records. The file is a CSV format file and originally I had thought I'd fixed the problem with this:
awk 'NR%2-1{gsub(/\r?\n/, FS)} NR>1{printf RS}1' RS=\" ORS= input.csv > output.csv
which worked great for the records that had the extra CRLFs inside of quotes, but it turns out we're also getting some records that have no quotes in them which throws the awk command off.
The last field in each record is a numerical field consisting of 4-7 digits, is there some way to use that as a reference to keep the CRLF that follows that field and remove the rest?
Alternatively, is there some way to strip CRLFs by counting the fields and removing them prior to the last field in the record?
Input looks like this:
SMITH,John,,,,,,05/10/1966,,USA,USA,A 5551212,"Resides in California, USA",,,,Mill Valley,,,,,USA,"Northern District of California, USA",Individual,,,12/09/2003,18/08/2015,78452
SMITH,Patty,,,,,,05/10/1974,,USA,USA,A 5551212,"Resides in Oregon, USA",,,,Portland,,,,,USA,"District of Oregon, USA",Individual,,,15/09/2002,02/02/2015,121567
The problem records look like below - note the carriage return and missing quotes:
SMITH,Bill,,,,,,05/10/1966,,USA,USA,A 5551212,"Resides in California, USA",,,,Mill Valley,,,,,USA,Northern District of
California, USA,Individual,,,12/09/2003,18/08/2015,78452
For output we need the record to be the same as the other two - all on one line:
SMITH,Bill,,,,,,05/10/1966,,USA,USA,A 5551212,"Resides in California, USA",,,,Mill Valley,,,,,USA,Northern District of California, USA,Individual,,,12/09/2003,18/08/2015,78452
Upvotes: 3
Views: 319
Reputation: 439377
If all you need is the removal of the field-internal CRLFs, try the following (assumes GNU awk
, but it could be made to work with BSD awk
as well):
awk -v RS='\r?\n' '/,[[:digit:]]{4,7}$/ { print; next } { printf("%s ", $0) }' input.csv > output.csv
/,[[:digit:]]{4,7}$/
matches only lines that end in 4-7 digits, implying that the line at hand is either a complete record or is a multi-line record's last line.
{ print; next }
simply prints the line with a terminating \n
(if you wanted \r\n
on output too, you'd have to useprintf("%s\r\n", $0)
instead).{ printf("%s ", $0) }
is then only printed for record fragments, i.e., a record that has a field-internal CRLF and therefore continues on the next line; by printing it with printf
and just a trailing space, the net effect is that multiple lines comprising a single record are effectively joined with a space each on output.Upvotes: 1