Reputation: 178

Removing \n in between double quotes in a CSV UNIX

I have a flat file generated by IBMs Change Data Delivery from Z/OS. There are new lines or \n in the data. I have found a way to replace most of them using an awk command but there is a slight bug it seems. If it a line with an odd number of quotes ends in a quote, it will not replace the \n with a space. For what ever reason I had to loop through twice to get most of them. I am left with 1 record that still has the \n. Here is a sample.

"2013-11-19 10:09:09","0","I","NOT SET   ","
simple string                            "

Needs to be essentially:

"2013-11-19 10:09:09","0","I","NOT SET   ","simple string                            "

Here is the code I am using:

#For loop#
for a in 1 2 
do
  awk -F'"' '$NF""{printf("%s ", $0);next}1' $1 > $1.filter
  rm -f $1
  mv $1.filter $1
  echo $a
done

This file has about 100k records in it. It gets picked up by datastage, but sees the \n throws that record out as it thinks the next line is supposed to be a new record.

Thanks, Josh

EDIT:

I discovered this Perl command that will do everyhing but still end up with the same bug as above.

$ perl -p -le 's/\n+/ /g'

Upvotes: 2

Answers (4)

Jmoreland91

Reputation: 178

I have researched what was suggested by glenn jackman and have worked a solution in python. Here is my code using Python:

#!/usr/bin/python

import sys, csv, os

inputfile=sys.argv[1]
outputfile=sys.argv[1] + '.filter'
newtext=' '

print inputfile
print outputfile

with open(inputfile, "rb") as input:
  with open(outputfile, "wb") as output:
    w = csv.writer(output, delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC, lineterminator='\n')
    for record in csv.reader(input):
      w.writerow(tuple(s.replace("\n", newtext) for s in record))

os.rename(outputfile, inputfile)

Thanks everyone for all the help. Hopefully someone having the same issue will find this. My only problem with this solution is that it adds quotes around all fields, including null fields.

Thanks, Josh

EDIT:

I was able to use perl to quickly remove all double quotes next to eachother.

perl -pi -le 's/""//g' data

Upvotes: 2

glenn jackman

Reputation: 247042

I would use a language with a CVS parser. Try to parse the current line, if there's an error, join the next line and try again: for example, with ruby:

ruby -rcsv -ne '
  chomp
  loop do
    begin
      row=CSV.parse_line($_)
      # if no error thrown, we have a parseable line
      puts row.inspect
      break
    rescue
      # grab the next line and try again
      $_ += gets
    end
  end
' << END
a,b,c,d,e
1,2,3,4,5
"2013-11-19 10:09:09","0","I","NOT SET   ","
simple string                            "
"a 1","b 2","c 3","d 4","e 5"
END

["a", "b", "c", "d", "e"]
["1", "2", "3", "4", "5"]
["2013-11-19 10:09:09", "0", "I", "NOT SET   ", "simple string                            "]
["a 1", "b 2", "c 3", "d 4", "e 5"]

Upvotes: 1

NeronLeVelu

Reputation: 10039

sed -n -e '/"/ {
   s/:/:d/g;s/\\"/:e/g
:b 
      /^\(\("[^"]*"\)*[^"]*\)*"\([^"]*\)$/ { 
      N
      s/\
//
      b b
      }
   s/:e/\\"/g;s/:d/:g/
  }
p' YourFile

use a "translation of \" before. This use a bit more cpu but pass throug escaped "

Upvotes: 0

that other guy

Reputation: 123570

Here's one way of doing it:

sed -n -e ':b; /^[^"]*"[^"]*\("[^"]*"[^"]*\)*$/ { N; s/\
//; bb; }; p; '

In pseudocode it goes

label foo:
  if we have an odd number of quotes:
    read and append the next line
    remove the line feed
    goto foo

print line

Example output:

$ cat file
"2013-11-19 10:09:09","0","I","NOT SET   ","
simple string                    "
"normal data",42
"some other
string"
$ sed -n -e ':b; /^[^"]*"[^"]*\("[^"]*"[^"]*\)*$/ { N; s/\
//; bb; }; p; ' < file
"2013-11-19 10:09:09","0","I","NOT SET   ","simple string                  "
"normal data",42
"some otherstring"
$

Note that any quotes escaped with backslash will ruin it ("foo\"bar"), while quotes escaped with quotes ("foo""bar") will work. Make sure you know what your CSV dialect you're using.

Upvotes: 4

Removing \n in between double quotes in a CSV UNIX

Answers (4)

Related Questions