Reputation: 13800
Consider a document in the following format:
<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
<div class="blog_post_item first">
<?php // some child elements ?>
</div><!-- end blog_post_item -->
</body>
</html>
I am loading a document like this from one domain to another with PHP cURL. I would like to trim my cURL result to only include div.blog_post_item.first
and its children. I know the structure of the other page, yet I can't edit it. I imagine I can use preg_match
to find the opening and closing tags; they will always look the same, including that ending comment.
I have searched for examples/tutorials of screen scraping with cURL/XPath/XSLT/whatever, and its mostly a cyclical rattling off of names of HTML parsing libraries. For that reason, please provide a simple working example. Please do not simply explain that parsing HTML with regex is a potential security vulnerability. Please do not just list libraries and specifications that I should read further into.
I have some simple PHP cURL code:
$ch = curl_init("http://a.web.page.com");
curl_setopt($ch, CURLOPT_HEADER, 0);
$output = curl_exec($ch);
curl_close($ch);
Of course, now $output
contains the entire source. How will I get just the contents of that element?
Upvotes: 0
Views: 4959
Reputation: 13843
That's quite easy if you are sure the begin and end is ALWAYS the same. All you have to do is search for the beginning and end and match everything between that. I think a lot of people will be pissed at me for using regex to find a bit of HTML but it'll do the job!
// cURL
$ch = curl_init("http://a.web.page.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
curl_close($ch);
if(empty($output)) exit('Couldn\'t download the page');
// finding your data
$pattern = '/<div class="blog_post_item first">(.*?)<\/div><!-- end blog_post_item -->/';
preg_match_all($pattern, $output, $matches);
var_dump($matches); // all matches
Because I don't know which website you're trying to crawl I'm not sure if this works or not.
After searching for quite a while (26 minutes to be exact) I have found why it didn't work. The dot (.
) doesn't match newlines. Because HTML is full of new lines, it couldn't match the contents. Using a slightly dirty hack I managed to get it matching anyway (even though you already picked an answer).
// cURL
$ch = curl_init('http://blogg.oscarclothilde.com/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
curl_close($ch);
if(empty($output)) exit('Couldn\'t download the page');
// finding your data
$pattern = '/<div class="blog_post_item first">(([^.]|.)*?)<\/div><!-- end blog_post_item -->/';
preg_match_all($pattern, $output, $matches);
var_dump($matches[1][0]); // all matches
Upvotes: 3
Reputation: 2408
If you are sure about the following structure:
<div class="blog_post_item first">
WHATEVER
</div><!-- end blog_post_item -->
AND you are sure the ending-code doesn't appear in WHATEVER, then you can simply grab it.
(Note please that I replaced your original PHP with WHATEVER. CURL will only fetch the HTML, and it will contain content, not PHP.)
You don't need a regex. You can also do it simply by searching for the wanted strings, like in my example below.
$curlResponse = '
<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
<div class="blog_post_item first">
<?php // some child elements ?>
</div><!-- end blog_post_item -->
</body>
</html>';
$startStr = '<div class="blog_post_item first">';
$endStr = '</div><!-- end blog_post_item -->';
$startStrPos = strpos($curlResponse, $startStr)+strlen($startStr);
$endStrPos = strpos($curlResponse, $endStr);
$wanted = substr($curlResponse, $startStrPos, $endStrPos-$startStrPos );
echo htmlentities($wanted);
Upvotes: 2
Reputation: 173542
This piece of code should work (>= 5.3.6 and dom extension):
$s = <<<EOM
<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
<div class="blog_post_item first">
<?php // some child elements ?>
</div><!-- end blog_post_item -->
</body>
</html>
EOM;
$d = new DOMDocument;
$d->loadHTML($s);
$x = new DOMXPath($d);
foreach ($x->query('//div[contains(@class, "blog_post_item") and contains(@class, "first")]') as $el) {
echo $d->saveHTML($el);
}
Upvotes: 2