user1928523
user1928523

Reputation: 29

Why would this preg_match_all suddenly stop working?

This code was working for days until it stopped working at the worst possible time. It simply pulls weather alert information from a NOAA website and displays it on my page. Can someone please tell me why this would suddenly fail?

$file = file_get_contents("http://forecast.weather.gov/showsigwx.php?warnzone=ARZ018&warncounty=ARC055");  
preg_match_all('#<div id="content">([^`]*?)<\/div>#', $file, $matches); 
$content = $matches[1];  

echo "content = ".$content."</br>" ;
echo "matches = ".$matches."</br>" ;
print_r ($matches); echo "</br>";
echo "file </br>".$file."</br></br>" ;

Now all I get is an empty array.

This is the output..

content = Array
matches = Array
Array ( [0] => Array ( ) [1] => Array ( ) )
file = the full page as requested by file_get_contents

Upvotes: 1

Views: 235

Answers (1)

Ilmari Karonen
Ilmari Karonen

Reputation: 50368

Your regexp is trying to match the literal string <div id="content">, followed by some (as few as possible) chars that are not backticks (`), followed by the literal string </div>.

However, in the current set of NOAA warnings and advisories, there is a backtick between <div id="content"> and </div>:

A SLIGHT RISK FOR SEVERE THUNDERSTORMS IS IN EFFECT FOR NORTHEAST
MISSISSIPPI SOUTH OF A CALHOUN CITY TO FULTON MISSISSIPPI LINE
FROM LATE THIS AFTERNOON THROUGH THIS EVENING. DAMAGING WINDS
WILL BE THE MAIN THREAT...HOWEVER AN ISOLATED TORNADO CAN`T BE
RULED OUT.

That's why your regexp doesn't match.

The simplest "fix" would be to replace the regexp with, say:

'#<div id="content">(.*?)<\/div>#s'

where . will, with the s modifier, match any character.

However, what you really should do is use a proper HTML parser to extract the text, instead of trying to parse HTML with regexps.


Edit: Here's a quick example (untested!) of how you could do this with DOMDocument:

$html = file_get_contents( $url );  
$doc = new DOMDocument();
$doc->loadHTML( $html );
$content = $doc->getElementById( 'content' )->textContent;

or even just:

$doc = new DOMDocument();
$doc->loadHTMLFile( $url );
$content = $doc->getElementById( 'content' )->textContent;

Upvotes: 6

Related Questions