Axel Stone
Axel Stone

Reputation: 1580

How to parse <media:content> tag in RSS with simplexml

Structure of my RSS from http://rss.cnn.com/rss/edition.rss is:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://rss.cnn.com/~d/styles/itemcontent.css"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
  <channel>
    <title><![CDATA[CNN.com - RSS Channel - Intl Homepage - News]]></title>
    <description><![CDATA[CNN.com delivers up-to-the-minute news and information on the latest top stories, weather, entertainment, politics and more.]]></description>
    <link>http://www.cnn.com/intl_index.html</link>
    ...

    <item>
      <title><![CDATA[Russia responds to claims it has damaging material on Trump]]></title>
      <description><![CDATA[The Kremlin denied it has compromising information about US President-elect Donald Trump, describing the allegations as "pulp fiction".]]></description>
      <link>http://www.cnn.com/2017/01/11/politics/russia-rejects-trump-allegations/index.html</link>
      <guid isPermaLink="true">http://www.cnn.com/2017/01/11/politics/russia-rejects-trump-allegations/index.html</guid>
      <pubDate>Wed, 11 Jan 2017 14:44:49 GMT</pubDate>
      <media:group>
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-super-169.jpg" height="619" width="1100" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-large-11.jpg" height="300" width="300" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-vertical-large-gallery.jpg" height="552" width="414" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-video-synd-2.jpg" height="480" width="640" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-live-video.jpg" height="324" width="576" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-t1-main.jpg" height="250" width="250" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-vertical-gallery.jpg" height="360" width="270" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-story-body.jpg" height="169" width="300" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-t1-main.jpg" height="250" width="250" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-assign.jpg" height="186" width="248" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-hp-video.jpg" height="144" width="256" />
      </media:group>
    </item>
    ...

  </channel>
</rss>

If you parse this XML with simplexml like this:

  $rss = simplexml_load_file($url, null, LIBXML_NOCDATA);

  $rssjson = json_encode($rss);
  $rssarray = json_decode($rssjson, TRUE);

you will see that <media:content> is simply missing in $rssarray items. So I found a tutorial with "namespace" solution. However, in the example author is using:

foreach ($xml->channel->item as $item) { ... }

but I am using (cannot use foreach for some reasons):

$rssjson = json_encode($rss);
$rssarray = json_decode($rssjson, TRUE);

So I modified the solution for my case like this:

  $rss = simplexml_load_file($url, null, LIBXML_NOCDATA);
  $namespaces = $rss->getNamespaces(true); // get namespaces

  $rssjson = json_encode($rss);
  $rssarray = json_decode($rssjson, TRUE);

  if (isset($rssarray['channel']['item'])) {
    foreach ($rssarray['channel']['item'] as $key => $item) {

      $media_content = $rss->channel->item[$key]->children($namespaces['media']);
      foreach($media_content as $tag) {

        $tagjson = json_encode($tag);
        $tagarray = json_decode($tagjson, TRUE);

      }

    }
  }

But it does not work. For every item I get in $tagarray as a result an array with this structure:

Array(
  'content' => array(
     '0' => array(null),
     '1' => array(null),
     ...
     '11' => array(null),
   )
)

It is an array with as many items as is the count of <media:content> tags, but every item is empty. I need to get an url attribute of every item. What am I doing wrong and getting an empty array?

Upvotes: 4

Views: 6793

Answers (2)

Subhash P
Subhash P

Reputation: 43

I had requirement to aggregate RSS news feeds from different source which had images tags in different formats so I used below code:

//Sample Feed 1: https://www.hindustantimes.com/rss/topnews/rssfeed.xml
//Sample Feed 2: https://economictimes.indiatimes.com/rssfeedsdefault.cms

$feed=$_GET['feed'];

$rss = simplexml_load_file($feed);
$namespaces = $rss->getNamespaces(true);

echo '<strong>'. $rss->channel->title . '</strong><br><br>';

foreach ($rss->channel->item as $item) {

    $media_content = $item->children($namespaces['media']);

    foreach($media_content as $i){
        $imageAlt = (string)$i->attributes()->url;
    }

    echo "Link: " . $item->link ."<br>";
    echo "Title: " . $item->title ."<br>";
    echo "Description: " . $item->description ."<br>";
    echo "PubDate: " . $item->pubDate ."<br>";
    echo "Image: " . $item->image ."<br>";
    echo "ImageAlt: " . $imageAlt ."<br>";
    echo "<br><br>";
} 

Upvotes: 0

&#193;lvaro Gonz&#225;lez
&#193;lvaro Gonz&#225;lez

Reputation: 146450

Tags are actually empty:

<media:content ... />
                   ^^

Information is contained in attributes, which can be fetched with SimpleXMLElement::attributes(), e.g.:

$rss = simplexml_load_file($url, null, LIBXML_NOCDATA);
$namespaces = $rss->getNamespaces(true);
$media_content = $rss->channel->item[0]->children($namespaces['media']);
foreach($media_content->group->content as $i){
    var_dump((string)$i->attributes()->url);
}

I suspect the problem comes from the JSON trick. SimpleXML generates all its classes and properties dynamically (they aren't regular PHP classes), what means that you can't fully rely on standard PHP features like print_r() or json_encode(). This gets illustrated if you insert this in the above loop:

var_dump($i, json_encode($i), (string)$i->attributes()->url);
object(SimpleXMLElement)#2 (0) {
}
string(2) "{}"
string(91) "http://i2.cdn.turner.com/cnnnext/dam/assets/161115120658-trump-putin-t1-tease-super-169.jpg"
...

Upvotes: 4

Related Questions