Reputation: 31
I have tasked with pulling certain values out of a very ugly csv file.
The csv is in the following format:
command1=value1, command2=value2, etc etc.
No problem so far I was grep-ing for the command I required and then piping through cut -f 2 -d '=' to return just the value.
The issue I have is one of the fields is text and can have multiple values which are also separated by comma. To add another curve ball if (and only if) one of the values has a space in it the field will be enclosed in double quotes so the value I'm looking to pull could be:
command=value,..
command=value1,value2,..
command="value 1",..
command="value 1, value 2",..
(where .. is other values in the log file OR the end of the line)
I thought I had cracked it by simply pulling the data between two field names using grep -oP '(?<=command1=).*(?= command2)' and then piping that through rev | cut -c 2- | rev.
But I've now found out the order the fields appear aren't consistent so the file could be:
command1=value1, command3=value3, command2=value2
How can I get the value of command2 when it may or may not be enclosed in double quotes, it may also have commas within it. I'm struggling to see how it may be possible as how will the grep know what is a value break and what is the next field.
Any help gratefully accepted.
Upvotes: 3
Views: 448
Reputation: 203702
idk if it's what you're looking for or not, but given this input file:
$ cat file
command1=value1.1,command2=value2.1,value2.2,command3="value 3.1",command4="value 4.1, value 4.2"
this GNU awk (for the 4th arg to split()) script might be what you want:
$ cat tst.awk
{
delete(c2v)
split($0,f,/,?[^=,]+=/,s)
for (i=1; i in s; i++) {
gsub(/^,|=$/,"",s[i])
print "populating command name to value array:", s[i], "->", f[i+1]
c2v[s[i]] = f[i+1]
}
print c2v["command2"]
print c2v["command4"]
}
$ awk -f tst.awk file
populating command to value: command1 -> value1.1
populating command to value: command2 -> value2.1,value2.2
populating command to value: command3 -> "value 3.1"
populating command to value: command4 -> "value 4.1, value 4.2"
value2.1,value2.2
"value 4.1, value 4.2"
Modify the print statements to suit, it should be obvious...
Upvotes: 0
Reputation: 189507
In the worst case (say, if , command2=
could occur in the quoted value of another key, for example) the only recourse is probably to write a dedicated parser for this pesky format. (Killing the person who came up with it will unfortunately not solve any problems, and may result in new ones. I understand it could be tempting, but don't.)
For a quick and dirty hack, perhaps this is sufficient, though:
grep -oP '(^|, )command2=\K([^,"]+|"[^"]+")'
This will keep the double quotes if the field value is quoted, but that shoud be easy to fix if it's undesired. Moving to a better tool than grep
could bring better precision as well, though; here's a sed
variant with additional anchoring:
sed -n 's/^\(.*, \)*command2=\(\((^,"]*\)\|"\([^"]*\)"\)\(, .*\)*$/\4\5/p'
Upvotes: 1
Reputation: 7166
I would combine grep
and sed
. Suppose you have this input in example.csv
:
command1=value1, command2=value2,
command1=value1, command2="value2, value3"
command1=value1, command3=valu3
Then this command:
grep 'command2=' example.csv |
sed -e 's/.*command2=//g' -e 's/^\([^"][^,]*\),.*$/\1/g' -e 's/^"\([^"]*\)".*$/\1/g'
Will give you this:
value2
value2, value3
Explanation:
grep
find the right linessed
(i.e. the firs -e
) removes everything before the desired valuePlease note that CSV is an extremely complicated format. This regex makes some assumptions, e.g. command2 appears only as key. If this csv is not good enough then I would use a real programming language that has a mature csv library.
Upvotes: 1