Bill in Kansas City
Bill in Kansas City

Reputation: 360

regex: parsing multiple quoted name/value pairs

Working in PHP 5.6/7.0.

I've tried several regexes from several questions and a couple hours on several regex websites and can't find anything that gets me what I need. I have a string like this:

At vero eos et accusamus et iusto odio dignissimos ducimus

<!-- @@include default="/admin/creditapp/templates/longform/" try="/wTemplates/forms/templates/" file="credit_row_1.txt" -->

qui blanditiis praesentium voluptatum deleniti atque corrupti

<!-- @@include default="/admin/creditapp/templates/longform/" try="/wTemplates/forms/templates/" file="credit_row_2.txt" -->

quos dolores et quas excepturi sint

I'm looking for the following matches from the tokens:

<!-- @@include ...the whole thing... -->
default
/admin/creditapp/templates/longform
try
/wtemplates/forms/templates
file
credit_row_1.txt

Repeated, naturally, for every time the whole group is found. I can loop the file and accomplish that, so just one instance at a time is fine. The only expression I could come up with that gets me that is:

<!-- @@include (?:(try|default|file)=\"(.+?)\"?)(?:\s*)(?:(try|default|file)=\"(.+?)\"?)(?:\s*)(?:(try|default|file)=\"(.+?)\"?)(?:\s*)-->

Which is HUGE, and doesn't allow for other possibilities, like, I don't know, "(try|foo|bar|default)" or something, or for the omission of either "try" or "default," e.g. "(foo|bar|file)."

In the template

<!-- @@include    -->

is constant. Inside that can be 2 to n name=value pairs. I tried:

(<!-- @@include (?:(try|default|file)=\"(.+?)\" ?){1,3}-->)

but it only returns the last name=value found. I'd like to think I'm close, but I can't work it out.

Upvotes: 0

Views: 41

Answers (1)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

PCRE is unable to store the different contents of a repeated capture group. When the group is repeated, the previous content is overwritten with the current and so on.
One workaround consists to use preg_match_all and to play with the \G anchor that matches the next position after the previous match (It matches also the start of the string by default).

preg_match_all('~(?:\G(?!\A)|<!-- @@include)\s+(try|default|file)="(.*?)"~', $str, $matches);

The idea of this kind of pattern is to succeed with the second branch <!-- @@include for the first match, and then with the first branch \G(?!\A) for all other consecutive matches. When the part \s+(try|default|file)="(.*?)" fails, the contiguity is broken and the regex engine has to find a next occurrence of <!-- @@include to continue.

If you want to know when the second branch succeeds, you only have to put a capture group in the second branch:

$result = [];

if ( preg_match_all('~(?:\G(?!\A)|<!-- (@)@include)\s+(try|default|file)="(.*?)"~', $str, $matches, PREG_SET_ORDER) ) {
    foreach ($matches as $m) {
        if ( !empty($m[1]) ) { // test which branch succeeds
            if ( isset($temp) )
                $result[] = $temp;
            $temp=[];
        }
        $temp[$m[2]] = $m[3];    
     }
}

if ( isset($temp) )
    $result[] = $temp;

demo


For something more flexible and able to deal with unknown keys, you can use two preg_match_all:

$result = [];

if ( preg_match_all('~<!-- @@include\s+\K\w+=".*?"(?:\s+\w+=".*?")*~', $str, $matches) ) {
    foreach ($matches[0] as $params) {
        if ( preg_match_all('~(\w+)="(.*?)"~', $params, $keyvals) )
            $result[] = array_combine($keyvals[1], $keyvals[2]);
    }
}

print_r($result);

demo

Note that this last solution can be more efficient with large strings in particular because the first pattern doesn't start with an alternation but with a literal string (In this case the pcre regex engine is able to optimize the research). The second pattern only has to deal with short strings, so it isn't a problem.

Upvotes: 1

Related Questions