Reputation: 1610

xml not well-formed because a special character inside CDATA

I have this xml:

<?xml version="1.0" encoding="UTF-8" ?>
            <rss xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:wp="http://wordpress.org/export/1.2/" version="2.0">
                <channel>
                    <wp:wxr_version>1.2</wp:wxr_version>
            <item>
                        <title type="html">
                        <![CDATA[ <h1 class="title">“Title with special character”</h1> ]]>
                        </title>
                        <content:encoded type="html">
                        <![CDATA[ <div class="content clearfix">
            <p>Content Example Text</p>
        </div> ]]>
                        </content:encoded>
                        <wp:post_id>0</wp:post_id>
                        <wp:post_date>2000-09-30T10:22:00.001Z</wp:post_date>           
                    </item>
                </channel>
            </rss>

Inside the html title tag there is the unicode character: U+0007

Why is the xml invalid?

I'm using CDATA, is this not supose to make it valid?

What can I do to validate which symbols are invalid and remove them before constructing the xml?

Upvotes: 0

Answers (1)

kjhughes

Reputation: 111785

Let's be clear that we're talking about whether the XML is well-formed rather than invalid.

U+0007 is a control character (BEL), used in the past to cause a terminal to beep. It's not allowed in XML, even within CDATA. If it's in the data, then the data is not XML. Your options are to remove it or encode it so that it's not directly in the data (and so that recipients will understand how to decode it); one encoding option would be Base64 for the contents of any element that has to be able to represent such illegal characters.

XML 1.0 vs 1.1

Michael Kay helpfully commented that XML 1.1 allows additional characters, including U+0007 (), beyond those allowed in XML 1.0.

For example, consider the following document¹:

<?xml version="1.0" encoding="UTF-8" ?>
<r>
  <e1></e1>  <!-- e1 contains a literal U+0007 char -->
  <e2>&#x07;</e2>  <!-- &#x07; becomes a U+0007 char -->
  <e3><![CDATA[]]></e3>  <!-- e3 CDATA contains a literal U+0007 char -->
  <e4><![CDATA[&#x07;]]></e4>  <!-- &#x07; remains an uninterpreted string -->
</r>

With an XML 1.0 version setting in the XML declaration:

U+0007 characters within e1, e2, and e3 prevent the XML from being well-formed.

With an XML 1.1 version setting in the XML declaration:

U+0007 characters within only e1 and e3 prevent the XML from being well-formed.

^{¹ Note that the question source (viewable via the edit link on the question) does indeed contain literal U+0007 characters where noted even though the formatted XML does not.}

Upvotes: 2

xml not well-formed because a special character inside CDATA

Answers (1)

See also

XML 1.0 vs 1.1

Related Questions