Kishan Ashra
Kishan Ashra

Reputation: 146

How to escape special characters present in xml string in MarkLogic?

I have a XML string coming from Java in base64 encoded format.

PHJvb3Q+PGNoaWxkPiY8L2NoaWxkPjxjaGlsZD48PC9jaGlsZD48Y2hpbGQ+PjwvY2hpbGQ+PGNoaWxkPns8L2NoaWxkPjxjaGlsZD59PC9jaGlsZD4vcm9vdD4=

I decode it using xdmp:base64-decode(). It gives me output as

<root><child>&</child><child><</child><child>></child><child>{</child><child>}</child>/root>

The output is a string. In order to convert it to XML, I use xdmp:unquote(), but the special characters present here produces an error.

I also tried using the repair-full option with xdmp:unquote(), but it didn't resolve the issue.

Note: I have some special characters present in my actual data those are causing some unwanted errors.

How to handle such type of scenario to insert the XML in MarkLogic?

Upvotes: 1

Views: 1716

Answers (1)

Mads Hansen
Mads Hansen

Reputation: 66714

The text from that base64 encoded string is not well-formed XML. In addition to the & and < not being encoded properly, the closing tag for the root element is missing <. At the end of the string, </child>/root> should be </child></root>.

As an example of how it might be possible to scrub the text and repair it, the below code will fix up this specific decoded value and then use xdmp:unquote() to parse as XML:

xdmp:unquote(
 replace(
  replace(
   replace(
     xdmp:base64-decode("PHJvb3Q+PGNoaWxkPiY8L2NoaWxkPjxjaGlsZD48PC9jaGlsZD48Y2hpbGQ+PjwvY2hpbGQ+PGNoaWxkPns8L2NoaWxkPjxjaGlsZD59PC9jaGlsZD4vcm9vdD4=")
   ,"&amp;", "&amp;amp;")
  ,"&gt;&lt;&lt;", "&gt;&amp;lt;&lt;")
 ,"/root>", "&lt;/root>")
)

It produces the following well-formed XML:

<root>
  <child>&</child>
  <child><</child>
  <child>></child>
  <child>{</child>
  <child>}</child>
</root>

However, this sort of repair is tedious and can become difficult. It is probably best to use tools such as TagSoup TagSoup to repair the markup and turn it into well-formed XML.

Upvotes: 1

Related Questions