Reputation: 1806
I am using javax.xml.transform.Transformer to take HTML content and parse into an XML document (I am using the Crouton/TagSoup combination to do this). This, I don't think is all too important, though, as here is my problem:
I am dumping the output of the Transformer.transform() process and seeing that in the output, things like ©
are getting converted to their actual symbol, in this case the copyright symbol.
Ultimately, this content will get re-saved as an HTML file, but instead of having the ©
showing up in the file, it puts this special character, which given HTML standards, should not be used.
Is there any way to get the transformer to ignore already encoded HTML characters from being converted into their actual symbols?
Upvotes: 1
Views: 4239
Reputation: 1806
This is not a proper solution to my original question, but this is a workaround that is getting me by.
Since HTML entities are being converted, before I send in the content string, I use a regular expression to "convert" the entities into another format, so the parser/transformer does not pick up on them.
Then in the outgoing string, I simply use another regular expression to convert them back into HTML entities.
Upvotes: 0
Reputation: 3260
You could try the following: Call transformer.setOutputProperty(OutputKeys.ENCODING, "ASCII")
. In this way, all non-ASCII characters have to use character entities.
Upvotes: 3
Reputation: 163645
If it's XSLT 2.0 you could use character maps - I believe someone has created character maps defining all the HTML character entities.
Since it's Java though, there's nothing to stop you using Saxon, and Saxon has a serialization attribute saxon:character-representation="entity" which seems to do what you want (it doesn't understand all the HTML-defined entities, however, only those in Latin-1.)
Upvotes: 2