Jasper
Jasper

Reputation: 47

Simple HTML Dom Crawler returns more than contained in attributes

I would like to extract the contents contained within certain parts of a website using selectors. I am using Simple HTML DOM to do this. However for some reason more data is returned than present in the selectors that I specify. I have checked the FAQ of Simple HTML DOM, but did not see anything that could help me out. I wasn't able to find anything on Stackoverflow either.

I am trying to get the contents/hrefs of all h2 class="hed" tags contained within the ul class="river" on this webpage: http://www.theatlantic.com/most-popular/

In my output I am receiving a lot of data from other tags like p class="dek has-dek" that are not contained within the h2 tag and should not be included. This is really strange as I thought the code would only allow for content within those tags to be scraped.

How can I limit the output to only include the data contained within the h2 tag?

Here is the code I am using:

<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');

$target_url = "http://www.theatlantic.com/most-popular/";

$html = new simple_html_dom();

$html->load_file($target_url);

$posts = $html->find('ul[class=river]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
  $post = $posts[$i];
  $post->find('h2[class=hed]',0)->outertext = "";
  echo strip_tags($post, '<p><a>');
  }
  ?>
  </div>

Output can be seen here. Instead of only a couple of article links, I get information of the author, information on the article, among others.

Upvotes: 0

Views: 191

Answers (2)

pguardiario
pguardiario

Reputation: 54984

You really only need one loop. Consider this:

foreach($html->find('ul.river > h2.hed') as $postNum => $h2) {
  if ($postNum >= 10) break;
  echo strip_tags($h2, '<p><a>') . "\n"; // the text
  echo $h2->parent->href . "\n"; // the href
}

Upvotes: 0

trincot
trincot

Reputation: 350310

You are not outputting the h2 contents, but the ul contents in the echo:

echo strip_tags($post, '<p><a>');

Note that the statement before the echo does not modify $post:

$post->find('h2[class=hed]',0)->outertext = "";

Change code to this:

$hed = $post->find('h2[class=hed]',0);
echo strip_tags($hed, '<p><a>');

However, that will only do something with the first found h2. So you need another loop. Here is a rewrite of the code after load_file:

$posts = $html->find('ul[class=river]');
foreach($posts as $postNum => $post) {
    if ($postNum >= 10) break; // limit reached
    $heds = $post->find('h2[class=hed]');
    foreach($heds as $hed) {
        echo strip_tags($hed, '<p><a>');
    }
}

If you still need to clear outertext, you can do it with $hed:

$hed->outertext = "";

Upvotes: 1

Related Questions