Reputation: 27
I would like to parse news titles and links from the following RSS page:
http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE
I have tried using this code (but it's not working):
<?php
$xml=("http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE");
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
for ($i=0; $i<=5; $i++) {
$title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
$link=$x->item($i)->getElementsByTagName('link')
->item(0)->childNodes->item(0)->nodeValue;
echo $title;
echo $link;
}
?>
However the same code is working to get RSS titles and links from other RSS pages.. for example:
<?php
$xml=("https://feeds.finance.yahoo.com/rss/2.0/headline?s=bcm.v®ion=US&lang=en-US");
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
for ($i=0; $i<=5; $i++) {
$title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
$link=$x->item($i)->getElementsByTagName('link')
->item(0)->childNodes->item(0)->nodeValue;
echo $title;
echo $link;
}
?>
Do you have any idea on how to make it work?
Thanks in advance!
Upvotes: 0
Views: 61
Reputation: 21522
The problem is that you are trying to download remote document with DOMDocument::load
. The method is capable of downloading remote files, but it doesn't set the User-Agent
HTTP header, if it is not specified via user_agent
INI setting. Some hosts are configured to reject HTTP requests, if the User-Agent
header is absent. And the URL you pasted into the question returns 403 Forbidden
, if the header is missing.
So you should either set user agent via INI settings:
ini_set('user_agent', 'MyCrawler/1.0');
$url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE';
$doc = new DOMDocument();
$doc->load($url);
or download the document manually with User-Agent
header set, e.g.:
$url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'MyCrawler/1.0');
$xml = curl_exec($ch);
$doc = new DOMDocument();
$doc->loadXML($xml);
The next problem with your code is that you are fully relying on specific DOM structure:
for ($i=0; $i<=5; $i++) {
$title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
The are many possible cases where you code will not work as expected: less than 5 items, missing elements, empty document, etc. Besides, the code is not very readable. You should always check if the node exists before going deeper into its structure, e.g.:
$channels = $doc->getElementsByTagName('channel');
foreach ($channels as $channel) {
// Print channel properties
foreach ($channel->childNodes as $child) {
if ($child->nodeType !== XML_ELEMENT_NODE) {
continue;
}
switch ($child->nodeName) {
case 'title':
echo "Title: ", $child->nodeValue, PHP_EOL;
break;
case 'description':
echo "Description: ", $child->nodeValue, PHP_EOL;
break;
}
}
}
You can parse the item
elements in similar manner:
$items = $channel->getElementsByTagName('item');
foreach ($items as $item) {
// ...
}
Upvotes: 1
Reputation: 149
They have security in place when no user agent is set so you'll have to use curl and fake an user agent to get the xml content eg:
$url = "http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$xml = curl_exec($ch);
Upvotes: 1