Reputation: 96
I want to be abled to load any html document and edit it using php's domdocument functionality.
The problem is, that some websites, for example facebook, add XML-style namespaces to their tags.
<fb:like send="true" width="450" show_faces="true"></fb:like>
DOMDocument is very tolerant concerning dirty code but it will not accept namescpaces in html code. What happens is:
So my idea was to convert the html I get into XML so I can parse it using loadXML. My question is, how do I do this, which tool should I use (I heard of Tidy but I can't get it to work) or is it the better idea to use a different parser (a parser that can handle namespaces in html code)
Code snippet:
<?php
$html = file_get_contents($_POST['url']);
$domDoc = new DOMDocument();
$domDoc->loadHTML($html);
//Just do anything here. It doesn't matter what. For example I'm deleting the head tag
$headTag = $domDoc->getElementsByTagName("head")->item(0);
$headTagParent = $headTag->parentNode;
$headTagParent->removeChild($headTag);
echo $domDoc->saveHTML();
//This will work as expected for any url EXCEPT the ones that use XML namespaces like facebook does as described above. In case of such dirty coding the namespace will get deleted by DOMDocument
?>
Upvotes: 4
Views: 542
Reputation: 1423
Building on Syndace's answer, here is some regex-based code that will escape out your namespaces by replacing each colon with "___" (you can choose some other escape sequence that you think is safer):
$modifiedHtml = preg_replace('/<(\/?)([a-z]+)\:/', '<$1$2___', $inputHtml);
$x = $doc->loadHTML($modifiedHtml);
// ...if desired, do stuff to your parsed html here...
$outputHtml = preg_replace('/<(\/?)([a-z]+)___/', '<$1$2:', $doc->saveHtml);
This should work on <fb:like>
, <mynamespace:mytag>
or anything else you throw at it.
Upvotes: 1
Reputation: 96
There is no clean way to parse HTML with namespaces using DOMDocument without losing the namespaces but there are some workarounds:
If you want to stick with DOMDocument you basically have to pre- and postprocess the code.
Before you send the code to DOMDocument->loadHTML, use regex, loops or whatever you want to find all namespaced tags and add a custom attribute to the opening tags containing the namespace.
<fb:like send="true" width="450" show_faces="true"></fb:like>
would then result in
<fb:like xmlNamespace="fb" send="true" width="450" show_faces="true"></fb:like>
Now give the edited code to DOMDocument->loadHTML. It will strip out the namespaces but it will keep the attributes resulting in
<like xmlNamespace="fb" send="true" width="450" show_faces="true"></like>
Now (again using regex, loops or whatever you want) find all tags with the attribute xmlNamespace and replace the attribute with the actual namespace. Don't forget to also add the namespace to the closing tags!
Upvotes: 4