Reputation: 47
I would like to extract the contents contained within certain parts of a website using selectors. I am using Simple HTML DOM to do this. However for some reason more data is returned than present in the selectors that I specify. I have checked the FAQ of Simple HTML DOM, but did not see anything that could help me out. I wasn't able to find anything on Stackoverflow either.
I am trying to get the contents/hrefs of all h2 class="hed" tags contained within the ul class="river" on this webpage: http://www.theatlantic.com/most-popular/
In my output I am receiving a lot of data from other tags like p class="dek has-dek" that are not contained within the h2 tag and should not be included. This is really strange as I thought the code would only allow for content within those tags to be scraped.
How can I limit the output to only include the data contained within the h2 tag?
Here is the code I am using:
<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');
$target_url = "http://www.theatlantic.com/most-popular/";
$html = new simple_html_dom();
$html->load_file($target_url);
$posts = $html->find('ul[class=river]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
$post = $posts[$i];
$post->find('h2[class=hed]',0)->outertext = "";
echo strip_tags($post, '<p><a>');
}
?>
</div>
Output can be seen here. Instead of only a couple of article links, I get information of the author, information on the article, among others.
Upvotes: 0
Views: 191
Reputation: 54984
You really only need one loop. Consider this:
foreach($html->find('ul.river > h2.hed') as $postNum => $h2) {
if ($postNum >= 10) break;
echo strip_tags($h2, '<p><a>') . "\n"; // the text
echo $h2->parent->href . "\n"; // the href
}
Upvotes: 0
Reputation: 350310
You are not outputting the h2
contents, but the ul
contents in the echo
:
echo strip_tags($post, '<p><a>');
Note that the statement before the echo
does not modify $post:
$post->find('h2[class=hed]',0)->outertext = "";
Change code to this:
$hed = $post->find('h2[class=hed]',0);
echo strip_tags($hed, '<p><a>');
However, that will only do something with the first found h2
. So you need another loop. Here is a rewrite of the code after load_file
:
$posts = $html->find('ul[class=river]');
foreach($posts as $postNum => $post) {
if ($postNum >= 10) break; // limit reached
$heds = $post->find('h2[class=hed]');
foreach($heds as $hed) {
echo strip_tags($hed, '<p><a>');
}
}
If you still need to clear outertext
, you can do it with $hed:
$hed->outertext = "";
Upvotes: 1