Reputation: 1610
I have this xml:
<?xml version="1.0" encoding="UTF-8" ?>
<rss xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wp="http://wordpress.org/export/1.2/" version="2.0">
<channel>
<wp:wxr_version>1.2</wp:wxr_version>
<item>
<title type="html">
<![CDATA[ <h1 class="title">“Title with special character”</h1> ]]>
</title>
<content:encoded type="html">
<![CDATA[ <div class="content clearfix">
<p>Content Example Text</p>
</div> ]]>
</content:encoded>
<wp:post_id>0</wp:post_id>
<wp:post_date>2000-09-30T10:22:00.001Z</wp:post_date>
</item>
</channel>
</rss>
Inside the html title tag there is the unicode character: U+0007
Why is the xml invalid?
I'm using CDATA, is this not supose to make it valid?
What can I do to validate which symbols are invalid and remove them before constructing the xml?
Upvotes: 0
Views: 935
Reputation: 111630
Let's be clear that we're talking about whether the XML is well-formed rather than invalid.
U+0007
is a control character (BEL), used in the past to cause a terminal to beep. It's not allowed in XML, even within CDATA. If it's in the data, then the data is not XML. Your options are to remove it or encode it so that it's not directly in the data (and so that recipients will understand how to decode it); one encoding option would be Base64 for the contents of any element that has to be able to represent such illegal characters.
Michael Kay helpfully commented that XML 1.1 allows additional characters, including U+0007
(
), beyond those allowed in XML 1.0.
For example, consider the following document1:
<?xml version="1.0" encoding="UTF-8" ?>
<r>
<e1></e1> <!-- e1 contains a literal U+0007 char -->
<e2></e2> <!--  becomes a U+0007 char -->
<e3><![CDATA[]]></e3> <!-- e3 CDATA contains a literal U+0007 char -->
<e4><![CDATA[]]></e4> <!--  remains an uninterpreted string -->
</r>
With an XML 1.0 version setting in the XML declaration:
U+0007
characters within e1
, e2
, and e3
prevent the XML from being well-formed.With an XML 1.1 version setting in the XML declaration:
U+0007
characters within only e1
and e3
prevent the XML from being well-formed.Upvotes: 2