Reputation: 1806

How to keep javax.xml.transform.Transformer from converting/parsing HTML-encoded characters

I am using javax.xml.transform.Transformer to take HTML content and parse into an XML document (I am using the Crouton/TagSoup combination to do this). This, I don't think is all too important, though, as here is my problem:

I am dumping the output of the Transformer.transform() process and seeing that in the output, things like © are getting converted to their actual symbol, in this case the copyright symbol.

Ultimately, this content will get re-saved as an HTML file, but instead of having the © showing up in the file, it puts this special character, which given HTML standards, should not be used.

Is there any way to get the transformer to ignore already encoded HTML characters from being converted into their actual symbols?

Upvotes: 1

Answers (3)

jzimmerman2011

Reputation: 1806

This is not a proper solution to my original question, but this is a workaround that is getting me by.

Since HTML entities are being converted, before I send in the content string, I use a regular expression to "convert" the entities into another format, so the parser/transformer does not pick up on them.

Then in the outgoing string, I simply use another regular expression to convert them back into HTML entities.

Upvotes: 0

michael667

Reputation: 3260

You could try the following: Call transformer.setOutputProperty(OutputKeys.ENCODING, "ASCII"). In this way, all non-ASCII characters have to use character entities.

Upvotes: 3

Michael Kay

Reputation: 163645

If it's XSLT 2.0 you could use character maps - I believe someone has created character maps defining all the HTML character entities.

Since it's Java though, there's nothing to stop you using Saxon, and Saxon has a serialization attribute saxon:character-representation="entity" which seems to do what you want (it doesn't understand all the HTML-defined entities, however, only those in Latin-1.)

Upvotes: 2

How to keep javax.xml.transform.Transformer from converting/parsing HTML-encoded characters

Answers (3)

Related Questions