seanchann
seanchann

Reputation: 35

Awk Ignore the delimiter in double quotation marks

how to tell awk ignore the delmiter in double quotation marks

eg line='test,t2,t3,"t5,"' $(echo $line | awk -F "," '{print $4}')

Expected value is "t5," but in fact is "t5"

how to get "t5,"?

Upvotes: 1

Views: 5008

Answers (4)

piojo
piojo

Reputation: 6723

In the general case, you can't. You need a full parser to remember a tag, change state, then go back to the prior state when it encounters the matching tag. You can't do it with a regular expression unless you make a lot of assumptions about the shape of your data--and since I see you're parsing CSV, those assumptions will not hold true.

If you like awk, I suggest trying perl for this problem. You can either use somebody else's CSV parsing library (search here), or you can write your own. Of course, there's no reason you can't write a CSV parser in pure awk, so long as you understand that this is not what awk is good at. You need to parse character by character (don't separate records by newlines), remember the current state (is the line quoted?) and remember the previous character to see whether it was a backslash (for treating a quote as a literal quote or a comma as a literal comma). You need to remember the previous quote so you can parse "" as an escaped quote instead of a malformed field. It's kind of fun, and it's a bitch. Use somebody else's library if you like. I wouldn't choose awk to write any parser where the records don't have an obvious separator.

Edit: Ed Morton actually did write a full CSV parser for Gawk, which he linked to in his answer. I helped him break it, and he quickly fixed the problem case. His script will be useful, though it will be somewhat unwieldy to adapt to real-world uses.

Upvotes: 0

Claes Wikner
Claes Wikner

Reputation: 1517

Perhaps this is better.

echo 'test,t2,t3,"t5,"' | awk -F, '{print $(NF-1),$NF}' OFS=,

"t5,"

Upvotes: -1

Ed Morton
Ed Morton

Reputation: 203229

With GNU awk for FPAT, all you need for your case is:

$ line='test,t2,t3,"t5,"'
$ echo "$line" | awk -v FPAT='([^,]*)|("[^"]*")' '{print $4}'
"t5,"

and if your awk can contain newlines and escaped quotes then see What's the most robust way to efficiently parse CSV using awk?.

Upvotes: 4

John Goofy
John Goofy

Reputation: 1419

Your arbitrary input could be checked or if you know where your input is not well formatted, use substr() starting from index 2 in column 4.

$ echo 'test,t2,t3,"t5,"' | awk -F, '{printf "%s,\n", substr($4,2) }'
t5,

Upvotes: -1

Related Questions