Tim
Tim

Reputation: 27

Parsing RSS news not working

I would like to parse news titles and links from the following RSS page:

http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE

I have tried using this code (but it's not working):

<?php

$xml=("http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE");

$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');

for ($i=0; $i<=5; $i++) {
  $title=$x->item($i)->getElementsByTagName('title')
  ->item(0)->childNodes->item(0)->nodeValue;
  $link=$x->item($i)->getElementsByTagName('link')
  ->item(0)->childNodes->item(0)->nodeValue;

  echo $title;
  echo $link;

}
?>

However the same code is working to get RSS titles and links from other RSS pages.. for example:

<?php

$xml=("https://feeds.finance.yahoo.com/rss/2.0/headline?s=bcm.v&region=US&lang=en-US");

$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');

for ($i=0; $i<=5; $i++) {
  $title=$x->item($i)->getElementsByTagName('title')
  ->item(0)->childNodes->item(0)->nodeValue;
  $link=$x->item($i)->getElementsByTagName('link')
  ->item(0)->childNodes->item(0)->nodeValue;

  echo $title;
  echo $link;

}
?>

Do you have any idea on how to make it work?

Thanks in advance!

Upvotes: 0

Views: 61

Answers (2)

Ruslan Osmanov
Ruslan Osmanov

Reputation: 21522

Downloading Remote Documents

The problem is that you are trying to download remote document with DOMDocument::load. The method is capable of downloading remote files, but it doesn't set the User-Agent HTTP header, if it is not specified via user_agent INI setting. Some hosts are configured to reject HTTP requests, if the User-Agent header is absent. And the URL you pasted into the question returns 403 Forbidden, if the header is missing.

So you should either set user agent via INI settings:

ini_set('user_agent', 'MyCrawler/1.0');
$url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE';
$doc = new DOMDocument();
$doc->load($url);

or download the document manually with User-Agent header set, e.g.:

$url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'MyCrawler/1.0');
$xml = curl_exec($ch);

$doc = new DOMDocument();
$doc->loadXML($xml);

Traversing the DOM

The next problem with your code is that you are fully relying on specific DOM structure:

for ($i=0; $i<=5; $i++) {
  $title=$x->item($i)->getElementsByTagName('title')
    ->item(0)->childNodes->item(0)->nodeValue;

The are many possible cases where you code will not work as expected: less than 5 items, missing elements, empty document, etc. Besides, the code is not very readable. You should always check if the node exists before going deeper into its structure, e.g.:

$channels = $doc->getElementsByTagName('channel');
foreach ($channels as $channel) {
  // Print channel properties
  foreach ($channel->childNodes as $child) {
    if ($child->nodeType !== XML_ELEMENT_NODE) {
      continue;
    }
    switch ($child->nodeName) {
      case 'title':
        echo "Title: ", $child->nodeValue, PHP_EOL;
        break;
      case 'description':
        echo "Description: ", $child->nodeValue, PHP_EOL;
        break;
    }
  }
}

You can parse the item elements in similar manner:

$items = $channel->getElementsByTagName('item');
foreach ($items as $item) {
  // ...
}

Upvotes: 1

davidear
davidear

Reputation: 149

They have security in place when no user agent is set so you'll have to use curl and fake an user agent to get the xml content eg:

$url = "http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE";
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';

$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$xml = curl_exec($ch);

Upvotes: 1

Related Questions