Reputation: 8614
I need to retrieve content of <p>
tag with given class. Class could be simplecomment
or comment
...
So I wrote the following code
preg_match("|(<p class=\"(simple)?comment(.*)?\">)(.*)<\/p>|ism", $fcon, $desc);
Unfortunately, it returns nothing. However if I remove tag-ending part (<\/p>
) it works somehow, returing the string which is too long (from tag start to the end of the document) ...
What is wrong with my regular expression?
Upvotes: 0
Views: 1162
Reputation: 10090
Try using a dom parser like http://simplehtmldom.sourceforge.net/
If I read the example code on simplehtmldom's homepage correctly you could do something like this:
$html->find('div.simplecomment', 0)->innertext = '';
Upvotes: 2
Reputation: 50179
The quick fix here is the following:
'|(<p class="(simple)?comment[^"]*">)((?:[^<]+|(?!</p>).)*)</p>|is'
Changes:
(.*)
will just blindly match everything, which stops your regular expression from working, so I've replaced those instances completely with more strict matches:
comment(.*)?
... – this will match all or nothing, basically. I replaced this with [^"]*
since that will match zero or more non-"
characters (basically, it will match up to the closing "
character of the class
attribute.>)(.*)<\/p>
... – again, this will match too much. I've replaced it with an efficient pattern that will match all non-<
characters, and once it hits a <
it will check if it is followed by </p>
. If it is, it will stop matching (since we're at the end of the <p>
tag), otherwise it will continue.m
flag since it has no use in this regular expression.But it won't be reliable (imagine <p class="comment">...<p>...</p></p>
; it will match <p class="comment">...<p>...</p>
).
To make it reliable, you'll need to use recursive regular expressions or (even better) an HTML parser (or XML if it's XHTML you're dealing with.) There are even libraries out there that can handle malformed HTML "properly" (like browsers do.)
Upvotes: 0