Parsing RSS news not working

Question

I would like to parse news titles and links from the following RSS page:

http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE

I have tried using this code (but it's not working):

load($xml);
$x=$xmlDoc->getElementsByTagName('item');

for ($i=0; $i<=5; $i++) {
  $title=$x->item($i)->getElementsByTagName('title')
  ->item(0)->childNodes->item(0)->nodeValue;
  $link=$x->item($i)->getElementsByTagName('link')
  ->item(0)->childNodes->item(0)->nodeValue;

  echo $title;
  echo $link;

}
?>

However the same code is working to get RSS titles and links from other RSS pages.. for example:

load($xml);
$x=$xmlDoc->getElementsByTagName('item');

for ($i=0; $i<=5; $i++) {
  $title=$x->item($i)->getElementsByTagName('title')
  ->item(0)->childNodes->item(0)->nodeValue;
  $link=$x->item($i)->getElementsByTagName('link')
  ->item(0)->childNodes->item(0)->nodeValue;

  echo $title;
  echo $link;

}
?>

Do you have any idea on how to make it work?

Thanks in advance!

Ruslan Osmanov · Accepted Answer

Downloading Remote Documents

The problem is that you are trying to download remote document with DOMDocument::load. The method is capable of downloading remote files, but it doesn't set the User-Agent HTTP header, if it is not specified via user_agent INI setting. Some hosts are configured to reject HTTP requests, if the User-Agent header is absent. And the URL you pasted into the question returns 403 Forbidden, if the header is missing.

So you should either set user agent via INI settings:

ini_set('user_agent', 'MyCrawler/1.0');
$url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE';
$doc = new DOMDocument();
$doc->load($url);

or download the document manually with User-Agent header set, e.g.:

$url = 'http://www.londonstockexchange.com/exchange/CompanyNewsRSS.html?newsSource=RNS&companySymbol=LSE';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'MyCrawler/1.0');
$xml = curl_exec($ch);

$doc = new DOMDocument();
$doc->loadXML($xml);

Traversing the DOM

The next problem with your code is that you are fully relying on specific DOM structure:

for ($i=0; $i<=5; $i++) {
  $title=$x->item($i)->getElementsByTagName('title')
    ->item(0)->childNodes->item(0)->nodeValue;

The are many possible cases where you code will not work as expected: less than 5 items, missing elements, empty document, etc. Besides, the code is not very readable. You should always check if the node exists before going deeper into its structure, e.g.:

$channels = $doc->getElementsByTagName('channel');
foreach ($channels as $channel) {
  // Print channel properties
  foreach ($channel->childNodes as $child) {
    if ($child->nodeType !== XML_ELEMENT_NODE) {
      continue;
    }
    switch ($child->nodeName) {
      case 'title':
        echo "Title: ", $child->nodeValue, PHP_EOL;
        break;
      case 'description':
        echo "Description: ", $child->nodeValue, PHP_EOL;
        break;
    }
  }
}

You can parse the item elements in similar manner:

$items = $channel->getElementsByTagName('item');
foreach ($items as $item) {
  // ...
}

Parsing RSS news not working

Answers (2)

Downloading Remote Documents

Traversing the DOM

Related Questions