user3297287
user3297287

Reputation: 11

PHP and DOM - parsing error an XML with inside entities

I have a xml :

<title>My title</title>
<text>This is a text and I love it <3 </text>

When I try to parse it with DOM, I have an error because of the "<3": Warning: DOMDocument::loadXML(): StartTag: invalid element name in Entity...

Do you know how can I escape all inside special char but keeping my XML tree ? The goal is to use this method: $document->loadXML($xmlContent);

Tank a lot for your answers.

EDIT: I forget to say that I cannot modify the XML. I receive it like that and I have to do with it...

Upvotes: 1

Views: 626

Answers (3)

luis_pmb
luis_pmb

Reputation: 146

The symbol "<" is a predefined entity in XML and thus cannot be used in a text field. It should be replaced with:

&lt;

http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

So the input text should be:

<title>My title</title>
<text>This is a text and I love it &lt;3 </text>

An XML built like that should be rejected, and whoever sends it should replace the predefined entities for the allowed values. Doing said task with tools like htmlentities() and htmlspecialchars(), as Y U NO WORK suggests, is easy and straightforward.

Now, if you really need to parse said data, you need to sanitize it prior to parsing. This is not a recommended behaviour, particularly if you are receiving arbitrary text, but if it is a set of known or predictable characters, regular expressions can do the job.

This one, in particular, will remove a single "<" contained in a "text" element composed by characters, numbers or white spaces:

$xmlContent = preg_replace('/(<text>[a-zA-Z 0-9]*)[<]?([a-zA-Z 0-9]*<\/text>)/', '$1&lt;$2', $xmlContent);

It is very specific, but it is done on purpose: regular expressions are really bad at matching nested structures, such as HTML or XML. Applying more arbitrary regular expressions to HTML or XML can have wildly unexpected behaviours.

Upvotes: 2

Juan de Parras
Juan de Parras

Reputation: 778

You need put the content with special chars inside CDATA:

<text><![CDATA[This is a text and I love it <3 ]]></text>

Upvotes: 0

Realit&#228;tsverlust
Realit&#228;tsverlust

Reputation: 3953

XML says that every title has to start with a letter, nothing else is allowed, so the title <3 is not possible.

A workaround for this could be htmlentities() or htmlspecialchars(). But even that wont add a valid character to the beginning, so you should think about either:

  1. Manually add a letter in front of the tag with if
  2. Rework your XML so nothing like that can ever happen.

Upvotes: 0

Related Questions