acidtv
acidtv

Reputation: 170

PHP regex backreference

I'm trying to match attributes from a html tag, but I can't get it working :)

Let's take this tag for example:

<a href="ddd" class='sw ' w'>

Obviously the last part is not quite right.

Now I tried to match the attributes part with this piece of code:

preg_match('/(\s+\w+=(?P<quote>(\'|\"))[^(?P=quote)]*(?P=quote))*/U', " href=\"bla\" class='sw'sw'", $a);

Here $a is empty, and that's what I expect. But if I now take my complete expression it does match the last class part, which puzzles me. It looks like this:

preg_match('/<(?P<c>[\/]?)(?P<tag>\w+)(?P<atts>(\s+\w+=(?P<quote>(\'|\"))[^(?P=quote)]*(?P=quote))*)\s*(?P<sc>[\/]?)>/U', $tag, $a);

Now $a returns:

Array
(
[0] => <a href="ddd" class='sw ' w'>
[c] => 
[1] => 
[tag] => a
[2] => a
[atts] =>  href="ddd" class='sw ' w'
[3] =>  href="ddd" class='sw ' w'
[4] =>  class='sw ' w'
[quote] => '
[5] => '
[6] => '
[sc] => 
[7] => 
)

Notice the key 4 which contains the class part including the last 'w, while I did use the (U)ngreedy switch at the end.

Any clues?

Upvotes: 0

Views: 1561

Answers (2)

bobince
bobince

Reputation: 536429

[^(?P=quote)]

You can't do that. Character classes only contain single characters, backslash-escapes and - ranges; this character class matches any of the literal characters (, ), ?, P and so on.

Moreover, (?P=quote) is not a backreference, it's a recursive expression. It takes the regex from the earlier definition:

(?P<quote>(\'|\"))

and so matches either ' or " regardless of which quote was used at the start of the attribute value. Backrefs are done with expressions like \1 matching the numbered () match group.

But anyway, squeeks is right: parsing [X][HT]ML with regex is a total losing game. You will never come up with an expression that treats all possible markup correctly. Stop wasting your time and use an XML or HTML parser.

Upvotes: 0

squeeks
squeeks

Reputation: 1269

It's really a bad idea to try and regex HTML - there is a DOM Inspector for PHP that can do this.

Upvotes: 1

Related Questions