Reputation: 35
how to tell awk ignore the delmiter in double quotation marks
eg
line='test,t2,t3,"t5,"'
$(echo $line | awk -F "," '{print $4}')
Expected value is "t5," but in fact is "t5"
how to get "t5,"?
Upvotes: 1
Views: 5008
Reputation: 6723
In the general case, you can't. You need a full parser to remember a tag, change state, then go back to the prior state when it encounters the matching tag. You can't do it with a regular expression unless you make a lot of assumptions about the shape of your data--and since I see you're parsing CSV, those assumptions will not hold true.
If you like awk, I suggest trying perl for this problem. You can either use somebody else's CSV parsing library (search here), or you can write your own. Of course, there's no reason you can't write a CSV parser in pure awk, so long as you understand that this is not what awk is good at. You need to parse character by character (don't separate records by newlines), remember the current state (is the line quoted?) and remember the previous character to see whether it was a backslash (for treating a quote as a literal quote or a comma as a literal comma). You need to remember the previous quote so you can parse ""
as an escaped quote instead of a malformed field. It's kind of fun, and it's a bitch. Use somebody else's library if you like. I wouldn't choose awk to write any parser where the records don't have an obvious separator.
Edit: Ed Morton actually did write a full CSV parser for Gawk, which he linked to in his answer. I helped him break it, and he quickly fixed the problem case. His script will be useful, though it will be somewhat unwieldy to adapt to real-world uses.
Upvotes: 0
Reputation: 1517
Perhaps this is better.
echo 'test,t2,t3,"t5,"' | awk -F, '{print $(NF-1),$NF}' OFS=,
"t5,"
Upvotes: -1
Reputation: 203229
With GNU awk for FPAT, all you need for your case is:
$ line='test,t2,t3,"t5,"'
$ echo "$line" | awk -v FPAT='([^,]*)|("[^"]*")' '{print $4}'
"t5,"
and if your awk can contain newlines and escaped quotes then see What's the most robust way to efficiently parse CSV using awk?.
Upvotes: 4
Reputation: 1419
Your arbitrary input could be checked or if you know where your input is not well formatted, use substr()
starting from index 2 in column 4.
$ echo 'test,t2,t3,"t5,"' | awk -F, '{printf "%s,\n", substr($4,2) }'
t5,
Upvotes: -1