peter.petrov
peter.petrov

Reputation: 39457

Java StAX - error when parsing - Illegal character entity: expansion character code 0x19

I am reading/parsing an XML file with javax.xml.stream.XMLStreamReader.
The file contains this piece of XML data as shown below.

<Row>
  <AccountName value="Paving 101" />
  <AccountNumber value="20205" />
  <AccountId value="15012" />
  <TimePeriod value="2019-08-20" />
  <CampaignName value="CMP Paving 101" />
  <CampaignId value="34283" />
  <AdGroupName value="residential paving" />
  <AdGroupId value="1001035" />
  <AdId value="790008" />
  <AdType value="Expanded text ad" />
  <DestinationUrl value="" />
  <BidMatchType value="Broad" />
  <Impressions value="1" />
  <Clicks value="1" />
  <Ctr value="100.00%" />
  <AverageCpc value="1.05" />
  <Spend value="1.05" />
  <AveragePosition value="2.00" />
  <SearchQuery value="concrete&#x19;driveway&#x19;repair&#x19;methods" />
</Row>

Unfortunately I am getting this error and I am not sure how to resolve it.

    Error in downloadXML: 
    com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x19
     at [row,col {unknown-source}]: [674,40]
        at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:606)
        at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:479)
        at com.ctc.wstx.sr.StreamScanner.reportIllegalChar(StreamScanner.java:2448)
        at com.ctc.wstx.sr.StreamScanner.validateChar(StreamScanner.java:2395)
        at com.ctc.wstx.sr.StreamScanner.resolveSimpleEntity(StreamScanner.java:1218)
        at com.ctc.wstx.sr.BasicStreamReader.parseAttrValue(BasicStreamReader.java:1929)
        at com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(BasicStreamReader.java:3063)
        at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2961)
        at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2837)
        at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1072)

The problem seems to be with this character &#x19.
Of course I can first read the file simply as a text file, and replace this bad character, and only then parse it with XMLStreamReader but:
1) that approach seems really clumsy to me;
2) it will be a bit difficult to do as the code is quite involved there,
so I am not sure if I want to change it just for this character.

Why is the XMLStreamReader unable to handle this character?
Is the XML invalid or the parser has a bug and does not handle it well?

Upvotes: 1

Views: 2344

Answers (2)

Vanja D.
Vanja D.

Reputation: 854

The problem is that the XML that is being parsed is malformed - it contains \&#x19; character reference, which is not within the legal character range in XML 1.0.

This code snippet removes such characters from malformed XML strings.

    public static String removeInvalidXmlCharacterReferences(
            String xmlString
    ) {
        // regex to match character references:
        // "&#(?:x([0-9a-fA-F]+)|([0-9]+));"
        Pattern pattern = Pattern.compile(
                "&#" + // all character references start with &#
                "(?:" + // non-capture group, containing either...
                "x([0-9a-fA-F]+)|" + // (1) hex character reference OR
                "([0-9]+)" + // (2) decimal character reference
                ");" // end group, followed by ";"
        );
        // contains invalid references found in the content
        Set<String> invalidReferences = new HashSet<>();
        Matcher matcher = pattern.matcher(xmlString);
        while (matcher.find()) {
            String reference = matcher.group(0); // "&#2;" or "&#B"
            String hexMatch = matcher.group(1);  // "B"
            String intMatch = matcher.group(2);  // "2"
            int character = hexMatch != null ?
                    Integer.parseInt(hexMatch, 16) :
                    Integer.parseInt(intMatch);
            if (
                    character != 0x9 &&
                    character != 0xA &&
                    character != 0xD &&
                    (character < 0x20 || character > 0xD7FF) &&
                    (character < 0x10000 || character > 0x10FFFF)
            ) {
                // character is out of valid range
                // add "&#B" to invalid references
                invalidReferences.add(reference);
            }
        }
        if (invalidReferences.isEmpty()) {
            // no invalid references found, do not sanitize
            return xmlString;
        }
        // create a regex like: "&#2;|&#B"
        String invalidRefsRegex = String.join("|", invalidReferences);
        // remove "&#2;" or "&#B" from the XML
        return xmlString.replaceAll(invalidRefsRegex, "");
    }

It should be noted that illegal characters should be removed by the producer of the XML, but sometimes you don't have that option.

A version of the function is available as a more verbose XmlDeserUtils gist, which can be easily re-used.

This function was originally authored by Nicholas DiPiazza in This SO answer.

References:

W3C XML 1.0 Character sets (see character range):

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

W3C XML 1.0 Character and Entity References (see character references):

[66] CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'

Upvotes: 1

Indent
Indent

Reputation: 4967

The characters &, < and > (as well as " or ' in attributes) are invalid in XML.

They're escaped using XML entities, in this case you want &amp; for &.

Your XML is invalid with every correct library ; (You need may be correct the producer of this XML content )

**Edit* from https://www.w3.org/TR/xml/#NT-Char

Allowed range for a entity reference :

Reference ::= EntityRef | CharRef 
EntityRef ::=       '&' Name ';'
CharRef   ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]    /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Upvotes: 1

Related Questions