Reputation: 267287

How to parse only the first level of nested tags with regex?

Say I have a block of text like this:

<item>
   foo bar foo bar 
   <item> child item </item>
</item>
<item>
   second item
   <item> second child </item>
</item>

Here, what I want is to parse only the two top levels of <item>s to be parsed, and the result returned to me in an array like this:

[0] = "foo bar foo bar  <item>child item</item>"
[1] = "second item  <item>second child </item>";

However in my testing, since the child level <item> tags match the pattern, they are also included and I get a 4 element array rather than a 2 element array as I want.

This is the pattern I've used:

%<item>(.+)</item>%si

Any ideas?

Edit: This is NOT for HTML, its for a custom, in-house scripting language for which I can't use any dom parsers. So please suggest a regex solution.

Upvotes: 3

Answers (3)

Wiktor Stribiżew

Reputation: 627468

You say the input is not HTML, but provided a string that looks like HTML. See, regex works best on plain text, not a marked-up text. You do not reveal what the real language lies behind this type of input, so, the solution I can suggest is based on the assumption that the < char cannot appear as a literal in-between element nodes (only as some entity).

That means, you might use a negated character class [^<] and apply the * quantifier to it:

%<item>([^<]+)</item>%i

See the regex demo, PHP demo:

$text = "<item> foo bar foo bar <item> child item </item> </item> <item> second item <item> second child </item> </item>";
preg_match_all('%<item>([^<]*)<item>%i', $text, $matches);
print_r($matches[1]);
// => Array ( [0] =>  foo bar foo bar  [1] =>  second item )

Upvotes: 0

keyboardSmasher

Reputation: 2811

%<p>(.+?)^</p>%smi

edit

$text = "<item> foo bar foo bar <item> child item </item> </item> <item> second item <item> second child </item> </item>";
preg_match_all('%<item>(.*?<item>.*?</item>).*?</item>%si', $text, $matches);
print_r($matches[1]);

output

Array
(
    [0] =>  foo bar foo bar <item> child item </item>
    [1] =>  second item <item> second child </item>
)

Upvotes: 3

Jordan Mack

Reputation: 8763

Regex is not well suited to what you are doing. If you pursue this route, you will probably spend more time on it than if you just go a different route. I suggest you check out a DOM parser. The one below is fairly easy to use, and should work for your basic needs.

PHP Simple HTML DOM Parser

Also check out this question, since it will give you additional alternatives.

Upvotes: 1

How to parse only the first level of nested tags with regex?

Answers (3)

Related Questions