Nash
Nash

Reputation:

php - why does this regex truncate my string to zero length?

Yesterday I tracked down a strange bug which caused a website display only a white page - no content on it, no error message visible.

I found that a regular expression used in preg_replace was the problem.

I used the regex in order to replace the title html tag in the accumulated content just before echo´ing the html. The html got rather large on the page where the bug occured (60 kb - not too large) and it seemed like preg_replace / the regex used can only handle a string of certain length - or my regex is really messed up (also possible).

Look at this sample program which reproduces the problem (tested on PHP 5.2.9):


function replaceTitleTagInHtmlSource($content, $replaceWith) {
  return preg_replace('#(<title>)([\s\S]+)(<\/title>)#i', '$1'.$replaceWith.'$3', $content);
}


$dummyStr = str_repeat('A', 6000);

$totalStr = '<title>foo</title>';

for($i = 0; $i < 10; $i++) {
  $totalStr .= $dummyStr;
}

print 'orignal: ' . strlen($totalStr);
print '<hr />';

$replaced = replaceTitleTagInHtmlSource($totalStr, 'bar');

print 'replaced: ' . strlen($replaced);
print '<hr />';

Output:

orignal: 60018
replaced: 0

So - the function gets a string of length 60000 and returns a string with 0 length. Not what I wanted to do with my regex.


Changing

for($i = 0; $i < 10; $i++) {

to

for($i = 0; $i < 1; $i++) {

in order to decrease the total string length, the output is:

orignal: 6018
replaced: 6018


When I removed the replacing, the content of the page was displayed without any problems.

Upvotes: 0

Views: 239

Answers (4)

radarek
radarek

Reputation: 2648

Backtracking: [\s\S]+ will match ALL available characters, then go backwards through the string looking for the </title>. [^<]+ matches all characters that aren't < and therefore grabs </title> faster.

function replaceTitleTagInHtmlSource($content, $replaceWith) {
  return preg_replace('#(<title>)([^<]+)(</title>)#i', '$1'.$replaceWith.'$3', $content);
}

Upvotes: 1

pavium
pavium

Reputation: 15128

It thas been said many times before on SO, eg Regex to match the first ending HTMl tag (and probably will be mentioned again) that regexes are not appropriate for HTML because tags are too irregular.

Use DOM functions where they're available.

Upvotes: 1

Greg
Greg

Reputation: 321786

It seems like you're running into the backtracking limit.

This is confirmed if you print preg_last_error(): it returns PREG_BACKTRACK_LIMIT_ERROR.

You can either increase the limit in your ini file or using ini_set() or change your regular expression from ([\s\S]+) to .*?, which will stop it from backtracking so much.

Upvotes: 2

mauris
mauris

Reputation: 43619

Your regex seems to be a little funny.

([\s\S]+) matches all space and non-space. you should try (.*?) instead.

changing your function works for me:

function replaceTitleTagInHtmlSource($content, $replaceWith) {
  return preg_replace('`\<title\>(.*?)\<\/title\>`i', '<title>'.$replaceWith.'</title>', $content);
}

and the problem seems to be you trying to use $1 and $3 to match and

Upvotes: 0

Related Questions