Reputation: 11
I have a xml :
<title>My title</title>
<text>This is a text and I love it <3 </text>
When I try to parse it with DOM, I have an error because of the "<3": Warning: DOMDocument::loadXML(): StartTag: invalid element name in Entity...
Do you know how can I escape all inside special char but keeping my XML tree ? The goal is to use this method: $document->loadXML($xmlContent);
Tank a lot for your answers.
EDIT: I forget to say that I cannot modify the XML. I receive it like that and I have to do with it...
Upvotes: 1
Views: 626
Reputation: 146
The symbol "<" is a predefined entity in XML and thus cannot be used in a text field. It should be replaced with:
<
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
So the input text should be:
<title>My title</title>
<text>This is a text and I love it <3 </text>
An XML built like that should be rejected, and whoever sends it should replace the predefined entities for the allowed values. Doing said task with tools like htmlentities() and htmlspecialchars(), as Y U NO WORK suggests, is easy and straightforward.
Now, if you really need to parse said data, you need to sanitize it prior to parsing. This is not a recommended behaviour, particularly if you are receiving arbitrary text, but if it is a set of known or predictable characters, regular expressions can do the job.
This one, in particular, will remove a single "<" contained in a "text" element composed by characters, numbers or white spaces:
$xmlContent = preg_replace('/(<text>[a-zA-Z 0-9]*)[<]?([a-zA-Z 0-9]*<\/text>)/', '$1<$2', $xmlContent);
It is very specific, but it is done on purpose: regular expressions are really bad at matching nested structures, such as HTML or XML. Applying more arbitrary regular expressions to HTML or XML can have wildly unexpected behaviours.
Upvotes: 2
Reputation: 778
You need put the content with special chars inside CDATA:
<text><![CDATA[This is a text and I love it <3 ]]></text>
Upvotes: 0
Reputation: 3953
XML says that every title has to start with a letter, nothing else is allowed, so the title <3
is not possible.
A workaround for this could be htmlentities()
or htmlspecialchars()
. But even that wont add a valid character to the beginning, so you should think about either:
if
Upvotes: 0