migajek
migajek

Reputation: 8614

RegEx problem - retrieve content of tag with given class - preg_match(_all)

I need to retrieve content of <p> tag with given class. Class could be simplecomment or comment ...

So I wrote the following code

preg_match("|(<p class=\"(simple)?comment(.*)?\">)(.*)<\/p>|ism", $fcon, $desc);

Unfortunately, it returns nothing. However if I remove tag-ending part (<\/p>) it works somehow, returing the string which is too long (from tag start to the end of the document) ...

What is wrong with my regular expression?

Upvotes: 0

Views: 1162

Answers (2)

bjelli
bjelli

Reputation: 10090

Try using a dom parser like http://simplehtmldom.sourceforge.net/

If I read the example code on simplehtmldom's homepage correctly you could do something like this:

$html->find('div.simplecomment', 0)->innertext = '';

Upvotes: 2

Blixt
Blixt

Reputation: 50179

The quick fix here is the following:

'|(<p class="(simple)?comment[^"]*">)((?:[^<]+|(?!</p>).)*)</p>|is'

Changes:

  • The construct (.*) will just blindly match everything, which stops your regular expression from working, so I've replaced those instances completely with more strict matches:
    1. ...comment(.*)?... – this will match all or nothing, basically. I replaced this with [^"]* since that will match zero or more non-" characters (basically, it will match up to the closing " character of the class attribute.
    2. ...>)(.*)<\/p>... – again, this will match too much. I've replaced it with an efficient pattern that will match all non-< characters, and once it hits a < it will check if it is followed by </p>. If it is, it will stop matching (since we're at the end of the <p> tag), otherwise it will continue.
  • I removed the m flag since it has no use in this regular expression.

But it won't be reliable (imagine <p class="comment">...<p>...</p></p>; it will match <p class="comment">...<p>...</p>).

To make it reliable, you'll need to use recursive regular expressions or (even better) an HTML parser (or XML if it's XHTML you're dealing with.) There are even libraries out there that can handle malformed HTML "properly" (like browsers do.)

Upvotes: 0

Related Questions