Dave
Dave

Reputation: 1

PHP regex, incorrect escaping of html causing problem

I'm trying to use (.+?) to isolate the words "I. NEED. ISOLATION" in the source below:

<strong>Label:</strong></font></td>
    <td valign="top" width="82%"> <font face="Arial" size="2"> 
      I. NEED. ISOLATION  </font> </td>

using (.+?), I could do this:

$regex = '/stuff before(.+?)stuff after/';

and for this html, that would be:

$regex = '/<strong>Label:</strong></font></td>
    <td valign="top" width="82%"> <font face="Arial" size="2"> 
      (.+?)  </font> </td>/';

but it's choking up on it because of incorrect escaping. I'm not great in PHP. Can someone please advise which characters I should also escape based on html that looks like this?

<strong>Label:</strong></font></td>
    <td valign="top" width="82%"> <font face="Arial" size="2"> 
      I. NEED. ISOLATION  </font> </td>

Note that I'm not trying to design a regex pattern. I already have the pattern nailed down with (.+?), just need to know how to correctly escape the html so that php doesn't choke up on it.

Upvotes: 0

Views: 119

Answers (6)

Alan Moore
Alan Moore

Reputation: 75232

As a matter of fact, there's nothing in that string that has special meaning in a regex (except the (.+?), of course). The only reason the / is causing a problem is because you're using it as the regex delimiter. You just need to choose a different delimiter, like ~ for example:

$regex = '~<strong>Label:</strong></font></td>
    <td valign="top" width="82%"> <font face="Arial" size="2"> 
      (.+?)  </font> </td>~';

Upvotes: 0

Kamil Szot
Kamil Szot

Reputation: 17817

There is a funciton that does that for you. It's named preg_quote http://pl2.php.net/preg_quote

$regex = '/'.preg_quote('<strong>Label:</strong></font></td>
<td valign="top" width="82%"> <font face="Arial" size="2"> 
  ').'(.+?)'.preg_quote('  </font> </td>).'/';

You should also be careful with case sensitivity and line breaks. I often tend to add flags to my regexps to deal with it so they look like /(.+?)/is

Upvotes: 0

ghostdog74
ghostdog74

Reputation: 342393

$str=<<<EOF
<strong>Label:</strong></font></td>
    <td valign="top" width="82%"> <font face="Arial" size="2">
      I. NEED. ISOLATION  </font> </td>
EOF;

$s = explode("</font>",$str);
foreach($s as $k=>$v){
    if(strpos($v,'<font face="Arial" size="2">')){
        $t=explode('<font face="Arial" size="2">',$v);
        print trim($t[1])."\n";
    }
}

output

$ php test.php
I. NEED. ISOLATION

Upvotes: 0

Pascal MARTIN
Pascal MARTIN

Reputation: 401022

First of all, you should really not use regular expressions to try to "parse" HTML -- which is not quite regular.

Going with something like DOMDocument::loadHTML and some XPath query is generally a much better solution.


But, if you really want to go with a regex *(and it seems you do, judging from your comments to other answers)*, I suppose you should not use `/` as [regex delimiter][2] : there are too many slashed in HTML already -- it'll be an escaping hell, as you already noticed.

For instance, you could use a # as regex delimiter :

$str = <<<STR
<strong>Label:</strong></font></td>
    <td valign="top" width="82%"> <font face="Arial" size="2"> 
      I. NEED. ISOLATION  </font> </td>
STR;
$regex = '#<strong>Label:</strong></font></td>
    <td valign="top" width="82%"> <font face="Arial" size="2"> 
      (.+?)  </font> </td>#';
if (preg_match($regex, $str, $m)) {
  var_dump($m[1]);
}

Will get you :

string 'I. NEED. ISOLATION' (length=18)

Note the only thing I changed compared to your proposed code is the regex delimiter ;-)


And, using a character that's not present in the HTML string, I don't have anything to escape -- especially, I don't have to escape all the `/`s -- which means the regex is far more easy to both write, read, and understand.

Upvotes: 2

Gumbo
Gumbo

Reputation: 655269

If you’re using PCRE regular expressions, you need to escape the delimiters inside the regular expression (in your case the /):

'/<strong>Label:<\/strong><\/font><\/td>
<td valign="top" width="82%"> <font face="Arial" size="2"> 
  (.+?)  <\/font> <\/td>/'

But probably more important: Regular expressions are not suitable for parsing HTML. Better use a proper HTML parser like the one provided by DOMDocument and query it with DOMXPath.

Upvotes: 0

Amber
Amber

Reputation: 526643

See this previous StackOverflow question.

That said, the escaping issue is due to the / characters within, which are confusing the regex parser since you're using /es already to delimit the regex.

Upvotes: 3

Related Questions