eleven
eleven

Reputation: 6847

XmlSlurper converts html encoded symbols

def xmlNode = new XmlSlurper().parseText('<?xml version="1.0" encoding="utf-8"?><b>&#8240;</b>')
println XmlUtil.serialize(xmlNode)

Prints next:

<?xml version="1.0" encoding="UTF-8"?>
<b>
  ‰
</b>

Is there way to prevent converting &#8240; into ? XmlSlurper documentation says nothing.

Upvotes: 1

Views: 622

Answers (1)

Will
Will

Reputation: 14529

I wrote a POC overriding XmlSlurper.characters to handle the character entity. Apache commons StringEscapeUtils was also needed to convert back to entity code:

@Grab(group='commons-lang', module='commons-lang', version='2.6')

import org.apache.commons.lang.StringEscapeUtils as SE
import groovy.xml.XmlUtil

def parser = new XmlSlurper() {
    void characters(char[] buffer, int start, int length)  {
        def entity = SE.escapeXml(buffer[start].toString())
        super.characters entity.toCharArray(), start, entity.size() 
    }
}

def xml = parser.parseText '<?xml version="1.0" encoding="utf-8"?><b>&#8240;</b>'

def serialized = SE.unescapeXml( XmlUtil.serialize(xml) )
assert '<?xml version="1.0" encoding="UTF-8"?><b>&#8240;</b>\n' == serialized

Note this is handling a single character, you may need to tweak it a bit if you need multicharacter handling. Also note that in the assert a line break was needed. It was added by XmlUtil.serialize

No idea if it's the best way to do that, though.

Upvotes: 2

Related Questions