Reputation: 54836
I'm grabbing data from a published google spreadsheet, and all I want is the information inside of the content div (<div id="content">...</div>
)
I know that the content starts off as <div id="content">
and ends as </div><div id="footer">
What's the best / most efficient way to grab the part of the DOM that is inside there? I was thinking regular expression (see my example below) but it is not working and I'm not sure if it that efficient...
header('Content-type: text/plain');
$foo = file_get_contents('https://docs.google.com/spreadsheet/pub?key=0Ahuij-1M3dgvdG8waTB0UWJDT3NsUEdqNVJTWXJNaFE&single=true&gid=0&output=html&ndplr=1');
$start = '<div id="content">';
$end = '<div id="footer">';
$foo = preg_replace("#$start(.*?)$end#",'$1',$foo);
echo $foo;
UPDATE
I guess another question I have is basically about if it is just simpler and easier to use regex with start and end points rather than trying to parse through a DOM which might have errors and then extract the piece I need. Seems like regex would be the way to go but would love to hear your opinions.
Upvotes: 0
Views: 264
Reputation: 24506
Try changing your regex to $foo = preg_replace("#$start(.*?)$end#s",'$1',$foo);
, the s
modifier changes the .
to include new lines. As it is, your regex would have to all the content between the tags on the same line to match.
If your HTML page is any more complex than that, then regex probably won't cut it and you'd need to look into a parser like DOMDocument or Simple HTML DOM
Upvotes: 1
Reputation: 43265
Do not use regex, it can fail. Use PHP's inbuilt DOM parse : http://php.net/manual/en/class.domdocument.php
You can easily traverse and parse relevant content .
Upvotes: 0
Reputation: 2481
if you have a lot to do, I would recommend you take a look at http://simplehtmldom.sourceforge.net really good for this sort of thing.
Upvotes: 0