Simple HTML Dom Crawler returns more than contained in attributes

Question

I would like to extract the contents contained within certain parts of a website using selectors. I am using Simple HTML DOM to do this. However for some reason more data is returned than present in the selectors that I specify. I have checked the FAQ of Simple HTML DOM, but did not see anything that could help me out. I wasn't able to find anything on Stackoverflow either.

I am trying to get the contents/hrefs of all h2 class="hed" tags contained within the ul class="river" on this webpage: http://www.theatlantic.com/most-popular/

In my output I am receiving a lot of data from other tags like p class="dek has-dek" that are not contained within the h2 tag and should not be included. This is really strange as I thought the code would only allow for content within those tags to be scraped.

How can I limit the output to only include the data contained within the h2 tag?

Here is the code I am using:


load_file($target_url);

$posts = $html->find('ul[class=river]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
  $post = $posts[$i];
  $post->find('h2[class=hed]',0)->outertext = "";
  echo strip_tags($post, '');
  }
  ?>

Output can be seen here. Instead of only a couple of article links, I get information of the author, information on the article, among others.

trincot · Accepted Answer

You are not outputting the h2 contents, but the ul contents in the echo:

echo strip_tags($post, '');

Note that the statement before the echo does not modify $post:

$post->find('h2[class=hed]',0)->outertext = "";

Change code to this:

$hed = $post->find('h2[class=hed]',0);
echo strip_tags($hed, '');

However, that will only do something with the first found h2. So you need another loop. Here is a rewrite of the code after load_file:

$posts = $html->find('ul[class=river]');
foreach($posts as $postNum => $post) {
    if ($postNum >= 10) break; // limit reached
    $heds = $post->find('h2[class=hed]');
    foreach($heds as $hed) {
        echo strip_tags($hed, '');
    }
}

If you still need to clear outertext, you can do it with $hed:

$hed->outertext = "";

Simple HTML Dom Crawler returns more than contained in attributes

Answers (2)

Related Questions