Reputation: 83
I need simple parsing with embedded single and double quotes. For the following input:
" hello 'there ok \"hohh\" ' ciao \"eeee \" \" yessss 'aaa' \" %%55+ "
I need the following output:
["hello", "there ok \"hohh\" ", "ciao", "eeee ", " yessss 'aaa' ", "%%55+"]
Why does the following Ruby code that I came up with work? I do not understand the regex part. I know basic regex but I assume that the embedded quotes should not work but they still do, either with single ones having doubles and vice versa.
text.scan(/\"(.*?)\"|'(.*?)'|([^\s]+)/).flatten.select{|x|x}
Upvotes: 0
Views: 67
Reputation: 28305
No need to solve this with a custom regex; the ruby standard library contains a module for this: Shellwords
.
Manipulates strings like the UNIX Bourne shell
This module manipulates strings according to the word parsing rules of the UNIX Bourne shell.
Usage:
require 'shellwords'
str = " hello 'there ok \"hohh\" ' ciao \"eeee \" \" yessss 'aaa' \" %%55+ "
Shellwords.split(str)
#=> ["hello", "there ok \"hohh\" ", "ciao", "eeee ", " yessss 'aaa' ", "%%55+"]
# Or equivalently:
str.shellsplit
#=> ["hello", "there ok \"hohh\" ", "ciao", "eeee ", " yessss 'aaa' ", "%%55+"]
The above is the "right" answer. Use that. What follows is additional information to explain why to use this, and why your answer "sort-of" works.
Parsing these strings accurately is tricky! Your regex attempt works for most inputs, but does not properly handle various edge cases. For example, consider:
str = "foo\\ bar"
str.shellsplit
#=> ["foo bar"] (correct!)
str.scan(/\"(.*?)\"|'(.*?)'|([^\s]+)/).flatten.select{|x|x}
#=> ["foo\\", "bar"] (wrong!)
The method's implementation does still use a (more complex!) regex under the hood, but also handles edge cases such as invalid inputs - which yours does not.
line.scan(/\G\s*(?>([^\s\\\'\"]+)|'([^\']*)'|"((?:[^\"\\]|\\.)*)"|(\\.?)|(\S))(\s|\z)?/m)
So without digging too deeply into the flaws of your approach (but suffice to say, it doesn't always work!), why does it mostly work? Well, your regex:
/\"(.*?)\"|'(.*?)'|([^\s]+)/
...is saying:
"
is found, match as little as possible (.*?
) up until the closing "
.'
).[^\s]+
-- which could also, equivalently, have been written as \S+
).The .flatten
is necessary because you're using capture groups ((...)
). This could have been avoided if you'd used non-capture groups instead ((?:...)
).
The .select{|x|x}
, or (effectively) equivalently .compact
was also necessary because of these capture groups - since in each match, 2 of the 3 groups were not part of the result.
Upvotes: 1