Reputation: 267287
Say I have a block of text like this:
<item>
foo bar foo bar
<item> child item </item>
</item>
<item>
second item
<item> second child </item>
</item>
Here, what I want is to parse only the two top levels of <item>
s to be parsed, and the result returned to me in an array like this:
[0] = "foo bar foo bar <item>child item</item>"
[1] = "second item <item>second child </item>";
However in my testing, since the child level <item>
tags match the pattern, they are also included and I get a 4 element array rather than a 2 element array as I want.
This is the pattern I've used:
%<item>(.+)</item>%si
Any ideas?
Edit: This is NOT for HTML, its for a custom, in-house scripting language for which I can't use any dom parsers. So please suggest a regex solution.
Upvotes: 3
Views: 1670
Reputation: 627468
You say the input is not HTML, but provided a string that looks like HTML. See, regex works best on plain text, not a marked-up text. You do not reveal what the real language lies behind this type of input, so, the solution I can suggest is based on the assumption that the <
char cannot appear as a literal in-between element nodes (only as some entity).
That means, you might use a negated character class [^<]
and apply the *
quantifier to it:
%<item>([^<]+)</item>%i
See the regex demo, PHP demo:
$text = "<item> foo bar foo bar <item> child item </item> </item> <item> second item <item> second child </item> </item>";
preg_match_all('%<item>([^<]*)<item>%i', $text, $matches);
print_r($matches[1]);
// => Array ( [0] => foo bar foo bar [1] => second item )
Upvotes: 0
Reputation: 2811
%<p>(.+?)^</p>%smi
edit
$text = "<item> foo bar foo bar <item> child item </item> </item> <item> second item <item> second child </item> </item>";
preg_match_all('%<item>(.*?<item>.*?</item>).*?</item>%si', $text, $matches);
print_r($matches[1]);
output
Array
(
[0] => foo bar foo bar <item> child item </item>
[1] => second item <item> second child </item>
)
Upvotes: 3
Reputation: 8763
Regex is not well suited to what you are doing. If you pursue this route, you will probably spend more time on it than if you just go a different route. I suggest you check out a DOM parser. The one below is fairly easy to use, and should work for your basic needs.
Also check out this question, since it will give you additional alternatives.
Upvotes: 1