Reputation: 9038

Regular expression to remove tags and contents from RSS file

This is the example structure of my RSS file:

<item>
 <title>My Title</title>
 <link>http://www.link.com</link>
 <description>The description</description>
 <author>Blah Blah</author>
 <pubDate>Thu, 26 Jul 2012 10:17:15 -0400</pubDate>
 <media:content url="myimage.jpg">
  <media:title>sdafsd</media:title>
 </media:content>
 <position>1</position>
</item>

How can I remove the author tag and its contents, the entire media:content tag and its contents, and the position tag and its contents completely from the file using PHP regular expressions?

Thanks!

Upvotes: 1

Answers (4)

FtDRbwLXw6

Reputation: 28889

Disclaimer: For flexibility and reliability, you should always use a proper parser like DOMDocument for manipulating XML/HTML. That being said, if you're sure that your markup is well-formed, not subject to change structure, and will not contain nested duplicate tags, regular expressions can solve problems like this. But you should only use them if you know what you're doing.

You'll want to use preg_replace() to replace each match with an empty string (""). Here is how it could be done for the <author>...</author> block:

$markup = preg_replace('#<author>(.*?)</author>#is', '', $markup);

Basically this matches the beginning tag <author>, anything (or nothing) between the beginning/ending tags, and the ending tag </author>.

The other tags can be removed in similar fashion.

Upvotes: 0

Berry Langerak

Reputation: 18859

My previous answer was - rightfully - removed, I should have added it as a comment. Here's an alternative with DomDocument doing exactly what you want to do:

<?php

$xml = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>bla</title>
    <link>bla</link>
    <description>A description</description>
    <language>en-us</language>
    <item xmlns:media="http://search.yahoo.com/mrss/">
     <title>My Title</title>
     <link>http://www.link.com</link>
     <description>The description</description>
     <author>Blah Blah</author>
     <pubDate>Thu, 26 Jul 2012 10:17:15 -0400</pubDate>
     <media:content url="myimage.jpg">
      <media:title>sdafsd</media:title>
     </media:content>
     <position>1</position>
    </item>
  </channel>
</rss>
XML;

$doc = new DOMDocument();
$doc->loadXml( $xml );

foreach( $doc->getElementsByTagName( 'item' ) as $item ) {
    $item->removeChild( $item->getElementsByTagName( 'author' )->item( 0 ) );
    $item->removeChild( $item->getElementsByTagName( 'position' )->item( 0 ) );
            $item->removeChild( $item->getElementsByTagName( 'content' )->item( 0 ) );
}

var_dump( $doc->saveXml( ) );

Upvotes: 1

Andy

Reputation: 3743

   $content = file_get_contents($file_name)

$xmlElem = 'author'
$content = preg_replace('#<' . $xmlElem . '(?:\s+[^>]+)?>(.*?)</' . $xmlElem . '>#s', '', $content)


$xmlElem = 'media:content'
$content = preg_replace('#<' . $xmlElem . '(?:\s+[^>]+)?>(.*?)</' . $xmlElem . '>#s', '', $content)


$xmlElem = 'position'
$content = preg_replace('#<' . $xmlElem . '(?:\s+[^>]+)?>(.*?)</' . $xmlElem . '>#s', '', $content)

Upvotes: 0

Madara's Ghost

Reputation: 174967

Don't use Regex to parse HTML/XML, there are perfectly good parsers out there:

<?php

$xml = <<<XML
<item>
    <title>My Title</title>
    <link>http://www.link.com</link>
    <description>The description</description>
    <author>Blah Blah</author>
    <pubDate>Thu, 26 Jul 2012 10:17:15 -0400</pubDate>
    <media:content url="myimage.jpg">
        <media:title>sdafsd</media:title>
    </media:content>
    <position>1</position>
</item>
XML;

$dom = new DOMDocument();
//DOMDocument throws warnings when the XML is invalid, we don't care.
//Though in this case, the media: namespace would be ignored because it's not defined.
@$dom->loadXML($xml);
$document = $dom->documentElement;

//Find the elements you want to remove
$author = $document->getElementsByTagName("author")->item(0);
$content = $document->getElementsByTagName("content")->item(0);

//And remove them.
$document->removeChild($author);
$document->removeChild($content);

//Output the resulting XML.
echo $dom->saveXML();

Upvotes: 3

Regular expression to remove tags and contents from RSS file

Answers (4)

Related Questions