Reputation: 45
i need some help regarding this study script im building which im trying to fetch articles from a website.
Currently im able to get the article from 1 element but failing to get all elements, this is an example of the url im trying to fetch
<div class="entry-content">
</div>
<div class="entry-content">
</div>
<div class="entry-content">
</div>
This is my PHP code to get the content of the first div :
function getArticle($url){
$content = file_get_contents($url);
$first_step = explode( '<div class="entry-content">' , $content );
$separate_news = explode("</div>" , $first_step[1] );
$article = $separate_news[0];
echo $article;
}
Upvotes: 0
Views: 2767
Reputation: 350127
You should use DOMDocument
. Although it is a bit tricky to select nodes by CSS class, you can do it with DomXPath
like this:
$dom = new DomDocument();
$dom->load($url);
$xpath = new DomXPath($dom);
$classname="entry-content";
$nodes = $xpath->query('//*[contains(concat(" ", normalize-space(@class), " "), " entry-content ")]');
foreach($nodes as $node) {
echo $node->textContent . "\n";
}
The advantage is now also that HTML entities and other HTML that might occur inside the article content is converted as expected. Like &
becomes &
, and <b>bold</b>
just becomes bold
.
Upvotes: 1
Reputation: 2531
I have used this library before http://simplehtmldom.sourceforge.net/ . Full documentation is found here http://simplehtmldom.sourceforge.net/manual.htm . It's very easy to use and does a lot more. You could select your articles like:
$html = file_get_html($url);
$articles = $html->find(".entry-content");
foreach($articles as $article) echo $article->plaintext;
Upvotes: 1
Reputation: 147146
You should really use PHPs DOMDocument class for parsing HTML. In terms of your example code, the problem is that you're not processing all the results from your $first_step
array. You could try something like this:
$first_steps = explode( '<div class="entry-content">' , $content );
foreach ($first_steps as $first_step) {
if (strpos($first_step, '</div>') === false) continue;
$separate_news = explode("</div>" , $first_step );
$article = $separate_news[0];
echo $article;
}
Here's a small demo on 3v4l.org
Upvotes: 2