Waqar Mushtaq
Waqar Mushtaq

Reputation: 27

Ensure valid XHTML from a string in PHP

I'm using XHTML Transitional doctype for displaying content in a browser. But, the content is displayed it is passed through a XML Parser (DOMDocument) for giving final touches before outputting to the browser.

I use a custom designed CMS for my website, that allows me to make changes to the site. I have a module that allows me to display HTML scripts on my website in a way similar to WordPress widgets.

The problem i am facing right now is that I need to make sure any code provided through this module should be in a valid XHTML format or else the module will need to convert the code to valid XHTML. Currently if a portion of the input code is not XHTML compliant then my XML parser breaks and throws warnings.

What I am looking for is a solution that encodes the entities present in the URLs and text portions of the input provided via TextArea control. For example the following string will break the parser giving entity reference error:

<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>

Also the following line would cause same error:

<a href="http://www.somesite.com">Books & Cool stuff<a/>

P.S. If i use htmlentities or htmlspecialchars, they also convert the angle brackets of tags, which is not required. I just need the urls and text portions of the string to be escaped/encoded.

Any help would be greatly appreciated.

Thanks and regards, Waqar Mushtaq

Upvotes: 0

Views: 1238

Answers (3)

hakre
hakre

Reputation: 198227

As already suggested in a quick comment, you can solve the problem with the PHP tidy extensionDocs quite comfortable.

To convert a HTML fragment - even a good tag soup - into something DomDocument or SimpleXML can deal with, you can use something like the following:

$config = array(
    'output-xhtml' => 1,
    'show-body-only' => 1
);
$fragment = tidy_repair_string($html, $config);
$xhtml = sprintf("<body>%s</body>", $fragment);

Example: Format tag soup html as valid xhtml with tidy_repair_stringDocs.

Tidy has many options, these two used are needed for fragments and XHTML compatibility.

The only problem left now is that this XHTML fragment can contain entities that DomDocument or SimpleXML do not understand, for example &nbsp;. This and others are undefined in XML.

As far as DomDocument is concerned (you wrote you use it), it supports loading html instead of xml as well which deals with those entities:

$dom = new DomDocument;
$dom->loadHTML($xhtml);

Example: Loading HTML with DomDocument

Upvotes: 0

Halcyon
Halcyon

Reputation: 57721

What you'd need to do is generate valid XHTML in the first place. All your attributes much be htmlentitied.

<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>

should be

<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&amp;sumthing"></script>

and

<a href="http://www.somesite.com">Books & Cool stuff</a>

should be

<a href="http://www.somesite.com">Books &amp; Cool stuff</a>

It's not easy to always generate valid XHTML. If at all possible I would recommend you find some other way of doing the post processing.

Upvotes: 1

Jacek Kaniuk
Jacek Kaniuk

Reputation: 5229

HTML Tidy is a computer program and a library whose purpose is to fix invalid HTML and to improve the layout and indent style of the resulting markup.

http://tidy.sourceforge.net/

Examples of bad HTML it is able to fix:

  • Missing or mismatched end tags, mixed up tags
  • Adding missing items (some tags, quotes, ...)
  • Reporting proprietary HTML extensions
  • Change layout of markup to predefined style
  • Transform characters from some encodings into HTML entities

Upvotes: 0

Related Questions