David Taylor
David Taylor

Reputation: 59

Using XSLT to remove/replace invalid characters

I am looking for a way to test a given element in the source XML and remove characters that are not valid. Basically, I have a list of allowed characters and need a way to replace any not in that list. Can this be done in XSLT?

To clarify: I am using XSLT to process a Valid and Complete source XML file so that it can be sent to a consuming system.

The consuming system defines what characters are allowed in certain elements and will reject the XML payload if it contains characters that are not valid. For example: they have provided the following "rule" for valid characters for a specific field:

([0-9a-zA-Z/\-\?:\(\)\.,'\+ \r\n]+)

So what I am looking to do is replace any character that does not match the rule above with null. Right now the main cause of rejection is underscores in the field. I know I can use replace to remove that character but I was hoping to define a single replace rule that would replace any character that is not in the above rule.

Upvotes: 0

Views: 999

Answers (2)

ocrdu
ocrdu

Reputation: 2183

You could use translate() or replace() (the latter is XSLT2 only), I suppose, but if the characters are invalid in the sense that the XML is no longer well-formed, then you can't use XSLT, as it requires at least a well-formed XML document.

Using translate(), removing all characters except those specified in a list goes like this:

translate($string, translate($string,'0123456789',''),'')

The above will remove everything not in the set 0123456789.

The other answer shows a way of doing it using replace() and a regular expression.

If you have control over whatever generates the XML, I would look there for a solution.

Upvotes: 2

HarriKoo
HarriKoo

Reputation: 26

You can use replace() as hinted above. Using your regular expression for valid characters, you could try this:

replace($string,"[^0-9a-zA-Z/\-\?:\(\)\.,'\+ \r\n]+","")

You can see that your regular expression is almost as it was, except that ^ has been added to turn the set of valid characters to its complement.

Upvotes: 1

Related Questions