Reputation: 178
I have a flat file generated by IBMs Change Data Delivery from Z/OS. There are new lines or \n
in the data. I have found a way to replace most of them using an awk command but there is a slight bug it seems. If it a line with an odd number of quotes ends in a quote, it will not replace the \n
with a space. For what ever reason I had to loop through twice to get most of them. I am left with 1 record that still has the \n
. Here is a sample.
"2013-11-19 10:09:09","0","I","NOT SET ","
simple string "
Needs to be essentially:
"2013-11-19 10:09:09","0","I","NOT SET ","simple string "
Here is the code I am using:
#For loop#
for a in 1 2
do
awk -F'"' '$NF""{printf("%s ", $0);next}1' $1 > $1.filter
rm -f $1
mv $1.filter $1
echo $a
done
This file has about 100k records in it. It gets picked up by datastage, but sees the \n
throws that record out as it thinks the next line is supposed to be a new record.
Thanks, Josh
EDIT:
I discovered this Perl command that will do everyhing but still end up with the same bug as above.
$ perl -p -le 's/\n+/ /g'
Upvotes: 2
Views: 3198
Reputation: 178
I have researched what was suggested by glenn jackman and have worked a solution in python. Here is my code using Python:
#!/usr/bin/python
import sys, csv, os
inputfile=sys.argv[1]
outputfile=sys.argv[1] + '.filter'
newtext=' '
print inputfile
print outputfile
with open(inputfile, "rb") as input:
with open(outputfile, "wb") as output:
w = csv.writer(output, delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC, lineterminator='\n')
for record in csv.reader(input):
w.writerow(tuple(s.replace("\n", newtext) for s in record))
os.rename(outputfile, inputfile)
Thanks everyone for all the help. Hopefully someone having the same issue will find this. My only problem with this solution is that it adds quotes around all fields, including null fields.
Thanks, Josh
EDIT:
I was able to use perl to quickly remove all double quotes next to eachother.
perl -pi -le 's/""//g' data
Upvotes: 2
Reputation: 247042
I would use a language with a CVS parser. Try to parse the current line, if there's an error, join the next line and try again: for example, with ruby:
ruby -rcsv -ne '
chomp
loop do
begin
row=CSV.parse_line($_)
# if no error thrown, we have a parseable line
puts row.inspect
break
rescue
# grab the next line and try again
$_ += gets
end
end
' << END
a,b,c,d,e
1,2,3,4,5
"2013-11-19 10:09:09","0","I","NOT SET ","
simple string "
"a 1","b 2","c 3","d 4","e 5"
END
["a", "b", "c", "d", "e"]
["1", "2", "3", "4", "5"]
["2013-11-19 10:09:09", "0", "I", "NOT SET ", "simple string "]
["a 1", "b 2", "c 3", "d 4", "e 5"]
Upvotes: 1
Reputation: 10039
sed -n -e '/"/ {
s/:/:d/g;s/\\"/:e/g
:b
/^\(\("[^"]*"\)*[^"]*\)*"\([^"]*\)$/ {
N
s/\
//
b b
}
s/:e/\\"/g;s/:d/:g/
}
p' YourFile
use a "translation of \" before. This use a bit more cpu but pass throug escaped "
Upvotes: 0
Reputation: 123570
Here's one way of doing it:
sed -n -e ':b; /^[^"]*"[^"]*\("[^"]*"[^"]*\)*$/ { N; s/\
//; bb; }; p; '
In pseudocode it goes
label foo:
if we have an odd number of quotes:
read and append the next line
remove the line feed
goto foo
print line
Example output:
$ cat file
"2013-11-19 10:09:09","0","I","NOT SET ","
simple string "
"normal data",42
"some other
string"
$ sed -n -e ':b; /^[^"]*"[^"]*\("[^"]*"[^"]*\)*$/ { N; s/\
//; bb; }; p; ' < file
"2013-11-19 10:09:09","0","I","NOT SET ","simple string "
"normal data",42
"some otherstring"
$
Note that any quotes escaped with backslash will ruin it ("foo\"bar"
), while quotes escaped with quotes ("foo""bar"
) will work. Make sure you know what your CSV dialect you're using.
Upvotes: 4