Why does the following parsing solution work?

Question

I need simple parsing with embedded single and double quotes. For the following input:

" hello    'there   ok \"hohh\"   '   ciao    \"eeee  \"   \"  yessss 'aaa'  \"   %%55+ "

I need the following output:

["hello", "there   ok \"hohh\"   ", "ciao", "eeee  ", "  yessss 'aaa'  ", "%%55+"]

Why does the following Ruby code that I came up with work? I do not understand the regex part. I know basic regex but I assume that the embedded quotes should not work but they still do, either with single ones having doubles and vice versa.

text.scan(/\"(.*?)\"|'(.*?)'|([^\s]+)/).flatten.select{|x|x}

Tom Lord · Accepted Answer

No need to solve this with a custom regex; the ruby standard library contains a module for this: Shellwords.

Manipulates strings like the UNIX Bourne shell

This module manipulates strings according to the word parsing rules of the UNIX Bourne shell.

Usage:

require 'shellwords'

str = " hello    'there   ok "hohh"   '   ciao    "eeee  "   "  yessss 'aaa'  "   %%55+ "

Shellwords.split(str)
  #=> ["hello", "there   ok "hohh"   ", "ciao", "eeee  ", "  yessss 'aaa'  ", "%%55+"]
# Or equivalently:
str.shellsplit
  #=> ["hello", "there   ok "hohh"   ", "ciao", "eeee  ", "  yessss 'aaa'  ", "%%55+"]

The above is the "right" answer. Use that. What follows is additional information to explain why to use this, and why your answer "sort-of" works.

Parsing these strings accurately is tricky! Your regex attempt works for most inputs, but does not properly handle various edge cases. For example, consider:

str = "foo\ bar"

str.shellsplit
  #=> ["foo bar"] (correct!)

str.scan(/"(.*?)"|'(.*?)'|([^\s]+)/).flatten.select{|x|x}
  #=> ["foo\", "bar"] (wrong!)

The method's implementation does still use a (more complex!) regex under the hood, but also handles edge cases such as invalid inputs - which yours does not.

line.scan(/\G\s*(?>([^\s\\'"]+)|'([^\']*)'|"((?:[^"\]|\.)*)"|(\.?)|(\S))(\s|\z)?/m)

So without digging too deeply into the flaws of your approach (but suffice to say, it doesn't always work!), why does it mostly work? Well, your regex:

/"(.*?)"|'(.*?)'|([^\s]+)/

...is saying:

If " is found, match as little as possible (.*?) up until the closing ".
Same as above, for single quotes (').
If neither a single nor double quote is found, scan ahead to the first non-whitespace characters ([^\s]+ -- which could also, equivalently, have been written as \S+).

The .flatten is necessary because you're using capture groups ((...)). This could have been avoided if you'd used non-capture groups instead ((?:...)).

The .select{|x|x}, or (effectively) equivalently .compact was also necessary because of these capture groups - since in each match, 2 of the 3 groups were not part of the result.

Why does the following parsing solution work?

Answers (1)

Manipulates strings like the UNIX Bourne shell

Related Questions