Need a way to strip extra CRLFs from the middle of records

Question

We've got a data source that sends us files we need to automatically import into our system, but frequently there is bad data in them in the form of extra CRLFs inside one or two of the fields of some records. The file is a CSV format file and originally I had thought I'd fixed the problem with this:

awk 'NR%2-1{gsub(/
?
/, FS)} NR>1{printf RS}1' RS=" ORS= input.csv > output.csv

which worked great for the records that had the extra CRLFs inside of quotes, but it turns out we're also getting some records that have no quotes in them which throws the awk command off.

The last field in each record is a numerical field consisting of 4-7 digits, is there some way to use that as a reference to keep the CRLF that follows that field and remove the rest?

Alternatively, is there some way to strip CRLFs by counting the fields and removing them prior to the last field in the record?

Input looks like this:

SMITH,John,,,,,,05/10/1966,,USA,USA,A 5551212,"Resides in California, USA",,,,Mill Valley,,,,,USA,"Northern District of California, USA",Individual,,,12/09/2003,18/08/2015,78452
SMITH,Patty,,,,,,05/10/1974,,USA,USA,A 5551212,"Resides in Oregon, USA",,,,Portland,,,,,USA,"District of Oregon, USA",Individual,,,15/09/2002,02/02/2015,121567

The problem records look like below - note the carriage return and missing quotes:

SMITH,Bill,,,,,,05/10/1966,,USA,USA,A 5551212,"Resides in California, USA",,,,Mill Valley,,,,,USA,Northern District of
California, USA,Individual,,,12/09/2003,18/08/2015,78452

For output we need the record to be the same as the other two - all on one line:

SMITH,Bill,,,,,,05/10/1966,,USA,USA,A 5551212,"Resides in California, USA",,,,Mill Valley,,,,,USA,Northern District of California, USA,Individual,,,12/09/2003,18/08/2015,78452

mklement0 · Accepted Answer

If all you need is the removal of the field-internal CRLFs, try the following (assumes GNU awk, but it could be made to work with BSD awk as well):

awk -v RS='
?
' '/,[[:digit:]]{4,7}$/ { print; next } { printf("%s ", $0) }' input.csv > output.csv

/,[[:digit:]]{4,7}$/ matches only lines that end in 4-7 digits, implying that the line at hand is either a complete record or is a multi-line record's last line.
- { print; next } simply prints the line with a terminating (if you wanted on output too, you'd have to useprintf("%s ", $0) instead).
{ printf("%s ", $0) } is then only printed for record fragments, i.e., a record that has a field-internal CRLF and therefore continues on the next line; by printing it with printf and just a trailing space, the net effect is that multiple lines comprising a single record are effectively joined with a space each on output.

Need a way to strip extra CRLFs from the middle of records

Answers (1)

Related Questions