Reputation: 9038
This is the example structure of my RSS file:
<item>
<title>My Title</title>
<link>http://www.link.com</link>
<description>The description</description>
<author>Blah Blah</author>
<pubDate>Thu, 26 Jul 2012 10:17:15 -0400</pubDate>
<media:content url="myimage.jpg">
<media:title>sdafsd</media:title>
</media:content>
<position>1</position>
</item>
How can I remove the author tag and its contents, the entire media:content tag and its contents, and the position tag and its contents completely from the file using PHP regular expressions?
Thanks!
Upvotes: 1
Views: 446
Reputation: 28889
Disclaimer: For flexibility and reliability, you should always use a proper parser like DOMDocument
for manipulating XML/HTML. That being said, if you're sure that your markup is well-formed, not subject to change structure, and will not contain nested duplicate tags, regular expressions can solve problems like this. But you should only use them if you know what you're doing.
You'll want to use preg_replace()
to replace each match with an empty string (""
). Here is how it could be done for the <author>...</author>
block:
$markup = preg_replace('#<author>(.*?)</author>#is', '', $markup);
Basically this matches the beginning tag <author>
, anything (or nothing) between the beginning/ending tags, and the ending tag </author>
.
The other tags can be removed in similar fashion.
Upvotes: 0
Reputation: 18859
My previous answer was - rightfully - removed, I should have added it as a comment. Here's an alternative with DomDocument doing exactly what you want to do:
<?php
$xml = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>bla</title>
<link>bla</link>
<description>A description</description>
<language>en-us</language>
<item xmlns:media="http://search.yahoo.com/mrss/">
<title>My Title</title>
<link>http://www.link.com</link>
<description>The description</description>
<author>Blah Blah</author>
<pubDate>Thu, 26 Jul 2012 10:17:15 -0400</pubDate>
<media:content url="myimage.jpg">
<media:title>sdafsd</media:title>
</media:content>
<position>1</position>
</item>
</channel>
</rss>
XML;
$doc = new DOMDocument();
$doc->loadXml( $xml );
foreach( $doc->getElementsByTagName( 'item' ) as $item ) {
$item->removeChild( $item->getElementsByTagName( 'author' )->item( 0 ) );
$item->removeChild( $item->getElementsByTagName( 'position' )->item( 0 ) );
$item->removeChild( $item->getElementsByTagName( 'content' )->item( 0 ) );
}
var_dump( $doc->saveXml( ) );
Upvotes: 1
Reputation: 3743
$content = file_get_contents($file_name)
$xmlElem = 'author'
$content = preg_replace('#<' . $xmlElem . '(?:\s+[^>]+)?>(.*?)</' . $xmlElem . '>#s', '', $content)
$xmlElem = 'media:content'
$content = preg_replace('#<' . $xmlElem . '(?:\s+[^>]+)?>(.*?)</' . $xmlElem . '>#s', '', $content)
$xmlElem = 'position'
$content = preg_replace('#<' . $xmlElem . '(?:\s+[^>]+)?>(.*?)</' . $xmlElem . '>#s', '', $content)
Upvotes: 0
Reputation: 174967
Don't use Regex to parse HTML/XML, there are perfectly good parsers out there:
<?php
$xml = <<<XML
<item>
<title>My Title</title>
<link>http://www.link.com</link>
<description>The description</description>
<author>Blah Blah</author>
<pubDate>Thu, 26 Jul 2012 10:17:15 -0400</pubDate>
<media:content url="myimage.jpg">
<media:title>sdafsd</media:title>
</media:content>
<position>1</position>
</item>
XML;
$dom = new DOMDocument();
//DOMDocument throws warnings when the XML is invalid, we don't care.
//Though in this case, the media: namespace would be ignored because it's not defined.
@$dom->loadXML($xml);
$document = $dom->documentElement;
//Find the elements you want to remove
$author = $document->getElementsByTagName("author")->item(0);
$content = $document->getElementsByTagName("content")->item(0);
//And remove them.
$document->removeChild($author);
$document->removeChild($content);
//Output the resulting XML.
echo $dom->saveXML();
Upvotes: 3