Reputation: 39457
I am reading/parsing an XML file with javax.xml.stream.XMLStreamReader
.
The file contains this piece of XML data as shown below.
<Row>
<AccountName value="Paving 101" />
<AccountNumber value="20205" />
<AccountId value="15012" />
<TimePeriod value="2019-08-20" />
<CampaignName value="CMP Paving 101" />
<CampaignId value="34283" />
<AdGroupName value="residential paving" />
<AdGroupId value="1001035" />
<AdId value="790008" />
<AdType value="Expanded text ad" />
<DestinationUrl value="" />
<BidMatchType value="Broad" />
<Impressions value="1" />
<Clicks value="1" />
<Ctr value="100.00%" />
<AverageCpc value="1.05" />
<Spend value="1.05" />
<AveragePosition value="2.00" />
<SearchQuery value="concretedrivewayrepairmethods" />
</Row>
Unfortunately I am getting this error and I am not sure how to resolve it.
Error in downloadXML:
com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x19
at [row,col {unknown-source}]: [674,40]
at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:606)
at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:479)
at com.ctc.wstx.sr.StreamScanner.reportIllegalChar(StreamScanner.java:2448)
at com.ctc.wstx.sr.StreamScanner.validateChar(StreamScanner.java:2395)
at com.ctc.wstx.sr.StreamScanner.resolveSimpleEntity(StreamScanner.java:1218)
at com.ctc.wstx.sr.BasicStreamReader.parseAttrValue(BasicStreamReader.java:1929)
at com.ctc.wstx.sr.BasicStreamReader.handleNsAttrs(BasicStreamReader.java:3063)
at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:2961)
at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2837)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1072)
The problem seems to be with this character 
.
Of course I can first read the file simply as a text file, and replace this bad character, and only then parse it with XMLStreamReader
but:
1) that approach seems really clumsy to me;
2) it will be a bit difficult to do as the code is quite involved there,
so I am not sure if I want to change it just for this character.
Why is the XMLStreamReader unable to handle this character?
Is the XML invalid or the parser has a bug and does not handle it well?
Upvotes: 1
Views: 2344
Reputation: 854
The problem is that the XML that is being parsed is malformed - it contains \
character reference, which is not within the legal character range in XML 1.0.
This code snippet removes such characters from malformed XML strings.
public static String removeInvalidXmlCharacterReferences(
String xmlString
) {
// regex to match character references:
// "&#(?:x([0-9a-fA-F]+)|([0-9]+));"
Pattern pattern = Pattern.compile(
"&#" + // all character references start with &#
"(?:" + // non-capture group, containing either...
"x([0-9a-fA-F]+)|" + // (1) hex character reference OR
"([0-9]+)" + // (2) decimal character reference
");" // end group, followed by ";"
);
// contains invalid references found in the content
Set<String> invalidReferences = new HashSet<>();
Matcher matcher = pattern.matcher(xmlString);
while (matcher.find()) {
String reference = matcher.group(0); // "" or "&#B"
String hexMatch = matcher.group(1); // "B"
String intMatch = matcher.group(2); // "2"
int character = hexMatch != null ?
Integer.parseInt(hexMatch, 16) :
Integer.parseInt(intMatch);
if (
character != 0x9 &&
character != 0xA &&
character != 0xD &&
(character < 0x20 || character > 0xD7FF) &&
(character < 0x10000 || character > 0x10FFFF)
) {
// character is out of valid range
// add "&#B" to invalid references
invalidReferences.add(reference);
}
}
if (invalidReferences.isEmpty()) {
// no invalid references found, do not sanitize
return xmlString;
}
// create a regex like: "|&#B"
String invalidRefsRegex = String.join("|", invalidReferences);
// remove "" or "&#B" from the XML
return xmlString.replaceAll(invalidRefsRegex, "");
}
It should be noted that illegal characters should be removed by the producer of the XML, but sometimes you don't have that option.
A version of the function is available as a more verbose XmlDeserUtils gist, which can be easily re-used.
This function was originally authored by Nicholas DiPiazza in This SO answer.
References:
W3C XML 1.0 Character sets (see character range):
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
W3C XML 1.0 Character and Entity References (see character references):
[66] CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'
Upvotes: 1
Reputation: 4967
The characters &
, <
and >
(as well as "
or '
in attributes) are invalid in XML.
They're escaped using XML entities, in this case you want &
for &
.
Your XML is invalid with every correct library ; (You need may be correct the producer of this XML content )
**Edit* from https://www.w3.org/TR/xml/#NT-Char
Allowed range for a entity reference :
Reference ::= EntityRef | CharRef
EntityRef ::= '&' Name ';'
CharRef ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Upvotes: 1