log69
log69

Reputation: 83

Why does the following parsing solution work?

I need simple parsing with embedded single and double quotes. For the following input:

" hello    'there   ok \"hohh\"   '   ciao    \"eeee  \"   \"  yessss 'aaa'  \"   %%55+ "

I need the following output:

["hello", "there   ok \"hohh\"   ", "ciao", "eeee  ", "  yessss 'aaa'  ", "%%55+"]

Why does the following Ruby code that I came up with work? I do not understand the regex part. I know basic regex but I assume that the embedded quotes should not work but they still do, either with single ones having doubles and vice versa.

text.scan(/\"(.*?)\"|'(.*?)'|([^\s]+)/).flatten.select{|x|x}

Upvotes: 0

Views: 67

Answers (1)

Tom Lord
Tom Lord

Reputation: 28305

No need to solve this with a custom regex; the ruby standard library contains a module for this: Shellwords.

Manipulates strings like the UNIX Bourne shell

This module manipulates strings according to the word parsing rules of the UNIX Bourne shell.

Usage:

require 'shellwords'

str = " hello    'there   ok \"hohh\"   '   ciao    \"eeee  \"   \"  yessss 'aaa'  \"   %%55+ "

Shellwords.split(str)
  #=> ["hello", "there   ok \"hohh\"   ", "ciao", "eeee  ", "  yessss 'aaa'  ", "%%55+"]
# Or equivalently:
str.shellsplit
  #=> ["hello", "there   ok \"hohh\"   ", "ciao", "eeee  ", "  yessss 'aaa'  ", "%%55+"]

The above is the "right" answer. Use that. What follows is additional information to explain why to use this, and why your answer "sort-of" works.

Parsing these strings accurately is tricky! Your regex attempt works for most inputs, but does not properly handle various edge cases. For example, consider:

str = "foo\\ bar"

str.shellsplit
  #=> ["foo bar"] (correct!)

str.scan(/\"(.*?)\"|'(.*?)'|([^\s]+)/).flatten.select{|x|x}
  #=> ["foo\\", "bar"] (wrong!)

The method's implementation does still use a (more complex!) regex under the hood, but also handles edge cases such as invalid inputs - which yours does not.

line.scan(/\G\s*(?>([^\s\\\'\"]+)|'([^\']*)'|"((?:[^\"\\]|\\.)*)"|(\\.?)|(\S))(\s|\z)?/m)

So without digging too deeply into the flaws of your approach (but suffice to say, it doesn't always work!), why does it mostly work? Well, your regex:

/\"(.*?)\"|'(.*?)'|([^\s]+)/

...is saying:

  • If " is found, match as little as possible (.*?) up until the closing ".
  • Same as above, for single quotes (').
  • If neither a single nor double quote is found, scan ahead to the first non-whitespace characters ([^\s]+ -- which could also, equivalently, have been written as \S+).

The .flatten is necessary because you're using capture groups ((...)). This could have been avoided if you'd used non-capture groups instead ((?:...)).

The .select{|x|x}, or (effectively) equivalently .compact was also necessary because of these capture groups - since in each match, 2 of the 3 groups were not part of the result.

Upvotes: 1

Related Questions