Reputation: 27
I'm using XHTML Transitional doctype for displaying content in a browser. But, the content is displayed it is passed through a XML Parser (DOMDocument) for giving final touches before outputting to the browser.
I use a custom designed CMS for my website, that allows me to make changes to the site. I have a module that allows me to display HTML scripts on my website in a way similar to WordPress widgets.
The problem i am facing right now is that I need to make sure any code provided through this module should be in a valid XHTML format or else the module will need to convert the code to valid XHTML. Currently if a portion of the input code is not XHTML compliant then my XML parser breaks and throws warnings.
What I am looking for is a solution that encodes the entities present in the URLs and text portions of the input provided via TextArea control. For example the following string will break the parser giving entity reference error:
<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>
Also the following line would cause same error:
<a href="http://www.somesite.com">Books & Cool stuff<a/>
P.S. If i use htmlentities
or htmlspecialchars
, they also convert the angle brackets of tags, which is not required. I just need the urls and text portions of the string to be escaped/encoded.
Any help would be greatly appreciated.
Thanks and regards, Waqar Mushtaq
Upvotes: 0
Views: 1238
Reputation: 198227
As already suggested in a quick comment, you can solve the problem with the PHP tidy extensionDocs quite comfortable.
To convert a HTML fragment - even a good tag soup - into something DomDocument
or SimpleXML
can deal with, you can use something like the following:
$config = array(
'output-xhtml' => 1,
'show-body-only' => 1
);
$fragment = tidy_repair_string($html, $config);
$xhtml = sprintf("<body>%s</body>", $fragment);
Example: Format tag soup html as valid xhtml with tidy_repair_string
Docs.
Tidy has many options, these two used are needed for fragments and XHTML compatibility.
The only problem left now is that this XHTML fragment can contain entities that DomDocument
or SimpleXML
do not understand, for example
. This and others are undefined in XML.
As far as DomDocument
is concerned (you wrote you use it), it supports loading html instead of xml as well which deals with those entities:
$dom = new DomDocument;
$dom->loadHTML($xhtml);
Example: Loading HTML with DomDocument
Upvotes: 0
Reputation: 57721
What you'd need to do is generate valid XHTML in the first place. All your attributes much be htmlentitied.
<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>
should be
<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>
and
<a href="http://www.somesite.com">Books & Cool stuff</a>
should be
<a href="http://www.somesite.com">Books & Cool stuff</a>
It's not easy to always generate valid XHTML. If at all possible I would recommend you find some other way of doing the post processing.
Upvotes: 1
Reputation: 5229
HTML Tidy is a computer program and a library whose purpose is to fix invalid HTML and to improve the layout and indent style of the resulting markup.
Examples of bad HTML it is able to fix:
Upvotes: 0