Reputation: 54836

How do I grab part of a page's HTML DOM with PHP?

I'm grabbing data from a published google spreadsheet, and all I want is the information inside of the content div (<div id="content">...</div>)

I know that the content starts off as <div id="content"> and ends as </div><div id="footer">

What's the best / most efficient way to grab the part of the DOM that is inside there? I was thinking regular expression (see my example below) but it is not working and I'm not sure if it that efficient...

header('Content-type: text/plain');

$foo = file_get_contents('https://docs.google.com/spreadsheet/pub?key=0Ahuij-1M3dgvdG8waTB0UWJDT3NsUEdqNVJTWXJNaFE&single=true&gid=0&output=html&ndplr=1');

$start = '<div id="content">';
$end = '<div id="footer">';

$foo = preg_replace("#$start(.*?)$end#",'$1',$foo);

echo $foo;

UPDATE

I guess another question I have is basically about if it is just simpler and easier to use regex with start and end points rather than trying to parse through a DOM which might have errors and then extract the piece I need. Seems like regex would be the way to go but would love to hear your opinions.

Upvotes: 0

Answers (3)

Michael Low

Reputation: 24506

Try changing your regex to $foo = preg_replace("#$start(.*?)$end#s",'$1',$foo); , the s modifier changes the . to include new lines. As it is, your regex would have to all the content between the tags on the same line to match.

If your HTML page is any more complex than that, then regex probably won't cut it and you'd need to look into a parser like DOMDocument or Simple HTML DOM

Upvotes: 1