Reputation: 360
Working in PHP 5.6/7.0.
I've tried several regexes from several questions and a couple hours on several regex websites and can't find anything that gets me what I need. I have a string like this:
At vero eos et accusamus et iusto odio dignissimos ducimus
<!-- @@include default="/admin/creditapp/templates/longform/" try="/wTemplates/forms/templates/" file="credit_row_1.txt" -->
qui blanditiis praesentium voluptatum deleniti atque corrupti
<!-- @@include default="/admin/creditapp/templates/longform/" try="/wTemplates/forms/templates/" file="credit_row_2.txt" -->
quos dolores et quas excepturi sint
I'm looking for the following matches from the tokens:
<!-- @@include ...the whole thing... -->
default
/admin/creditapp/templates/longform
try
/wtemplates/forms/templates
file
credit_row_1.txt
Repeated, naturally, for every time the whole group is found. I can loop the file and accomplish that, so just one instance at a time is fine. The only expression I could come up with that gets me that is:
<!-- @@include (?:(try|default|file)=\"(.+?)\"?)(?:\s*)(?:(try|default|file)=\"(.+?)\"?)(?:\s*)(?:(try|default|file)=\"(.+?)\"?)(?:\s*)-->
Which is HUGE, and doesn't allow for other possibilities, like, I don't know, "(try|foo|bar|default)" or something, or for the omission of either "try" or "default," e.g. "(foo|bar|file)."
In the template
<!-- @@include -->
is constant. Inside that can be 2 to n name=value pairs. I tried:
(<!-- @@include (?:(try|default|file)=\"(.+?)\" ?){1,3}-->)
but it only returns the last name=value found. I'd like to think I'm close, but I can't work it out.
Upvotes: 0
Views: 41
Reputation: 89557
PCRE is unable to store the different contents of a repeated capture group. When the group is repeated, the previous content is overwritten with the current and so on.
One workaround consists to use preg_match_all
and to play with the \G
anchor that matches the next position after the previous match (It matches also the start of the string by default).
preg_match_all('~(?:\G(?!\A)|<!-- @@include)\s+(try|default|file)="(.*?)"~', $str, $matches);
The idea of this kind of pattern is to succeed with the second branch <!-- @@include
for the first match, and then with the first branch \G(?!\A)
for all other consecutive matches. When the part \s+(try|default|file)="(.*?)"
fails, the contiguity is broken and the regex engine has to find a next occurrence of <!-- @@include
to continue.
If you want to know when the second branch succeeds, you only have to put a capture group in the second branch:
$result = [];
if ( preg_match_all('~(?:\G(?!\A)|<!-- (@)@include)\s+(try|default|file)="(.*?)"~', $str, $matches, PREG_SET_ORDER) ) {
foreach ($matches as $m) {
if ( !empty($m[1]) ) { // test which branch succeeds
if ( isset($temp) )
$result[] = $temp;
$temp=[];
}
$temp[$m[2]] = $m[3];
}
}
if ( isset($temp) )
$result[] = $temp;
For something more flexible and able to deal with unknown keys, you can use two preg_match_all
:
$result = [];
if ( preg_match_all('~<!-- @@include\s+\K\w+=".*?"(?:\s+\w+=".*?")*~', $str, $matches) ) {
foreach ($matches[0] as $params) {
if ( preg_match_all('~(\w+)="(.*?)"~', $params, $keyvals) )
$result[] = array_combine($keyvals[1], $keyvals[2]);
}
}
print_r($result);
Note that this last solution can be more efficient with large strings in particular because the first pattern doesn't start with an alternation but with a literal string (In this case the pcre regex engine is able to optimize the research). The second pattern only has to deal with short strings, so it isn't a problem.
Upvotes: 1