Ambidex
Ambidex

Reputation: 857

Regex string until unescaped comma

I have the following string:

{lorum=Vestibulum id ligula porta felis euismod semper. Sed posuere\, consectetur est at lobortis.,ipsum= Cras mattis consectetur purus sit amet fermentum. Nulla vitae elit libero, a pharetra augue.}

Now, what I want to get is:

array (
    array( 
        'operator' => 'lorum',
        'value' => 'Vestibulum id ligula porta felis euismod semper. Sed posuere\, consectetur est at lobortis.'
    ),
    array(
        'operator' => 'ipsum',
        'value' => 'Cras mattis consectetur purus sit amet fermentum. Nulla vitae elit libero, a pharetra augue.'
    )
)

The biggest problem is that I can't get my regex to do a lookbehind on a .*, I was trying something like this (without naming the groups yet btw).

[{,]?([a-zA-Z_]*)=((?<!\\).*)[(?<!\\),}]

I'm using the RegExr engine from Gskinner to try out my regex's , also I tried a lot of other variations, but non successful till now...

Eventually this regex should be used in a PHP script. Ofcourse, I would not mind to totally rebuild the above regex, though I would like to keep it on a regex level. If not for the sake of speed, then just for regex educational purposes.

Upvotes: 0

Views: 326

Answers (3)

Martin Ender
Martin Ender

Reputation: 44259

As stema said in a comment, lookbehinds have to be of fixed length (or at least of finite length) in all regex engines except .NET's. Also [(?<!\\),}] doesn't mean anything really. It just matches any of the characters inside the square brackets. You could reverse your attempt and consume anything except for commas that are not escaped and closing braces:

([a-zA-Z_]*)=((?:[^\\,}]|\\.)*)

In free-spacing mode with some explanation:

([a-zA-Z_]*)=    # match and capture the key (as in your own regex)
(                # capture the value
  (?:            # non-capturing group for allowed sequences for the value
    [^\\,}]      # any character except backslash, comma and closing brace
  |              # OR
    \\.          # a backslash followed by anything
  )
  *              # repeat as long as possible
)                # end of capturing group

Note that this allows escaping of any character (including other backslashes and closing braces).

Note that PHP's preg_match_all will return the array in a slightly different structure than you need it (but it's easy to shift around to your needs). Also, in a PHP string you won't get around double escaping all the backslashes, so you'll have four of them each time. Like:

$pattern = '/([a-zA-Z_]*)=((?:[^\\\\,}]|\\\\.)*)/';

Working demo.

Also note that greedy patterns that cannot go past the end of what you want to match are in most cases more efficient than non-greedy solutions that try to find the first thing that is disallowed.

Upvotes: 4

Something like this: http://rubular.com/r/XLI9euNcL5

[{,]?([a-zA-Z_]*?)=(.*?)(?:[^\\][,]|})

Upvotes: 0

Richard Brown
Richard Brown

Reputation: 11436

The .* is being greedy and preventing the match. Try

[{,]?([a-zA-Z_]*?)=((?<!\\).*?)[(?<!\\),}]

Rubular: http://rubular.com/r/l8R3GCmalw

Upvotes: 0

Related Questions