Rachit Rampal
Rachit Rampal

Reputation: 111

How to tidy-up Processing Instructions in Marklogic

I have a content which is neither a valid HTML nor a XML in my legacy database. Considering the fact, it would be difficult to clean the legacy, I want to tidy this up in MarkLogic using xdmp:tidy. I am currently using ML-8.

<sub>
   <p>
      <???&dagger;?>
   </p>
</sub>

I'm passing this content to tidy functionality in a way :

declare variable $xml as node() := 
       <content>
           <![CDATA[<p><???&dagger;?></p>]]>
       </content>;

xdmp:tidy(xdmp:quote($xml//text()),
   <options xmlns="xdmp:tidy">
    <assume-xml-procins>yes</assume-xml-procins>
    <quiet>yes</quiet>
    <tidy-mark>no</tidy-mark>
    <enclose-text>yes</enclose-text>
    <indent>yes</indent>
  </options>)

As a result it returns :

<p>
<?  ?&dagger;?>
</p>

Now this result is not the valid xml format (I checked it via XML validator) due to which when I try to insert this XML into the MarkLogic it throws an error saying 'MALFORMED BODY | Invalid Processing Instruction names'.

I did some investigation around PIs but not much luck. I could have tried saving the content without PI but this is also not a valid PI too.

Upvotes: 1

Views: 229

Answers (1)

prker
prker

Reputation: 504

That is because what you think is a PI is in fact not a PI. From W3C:

2.6 Processing Instructions

[Definition: Processing instructions (PIs) allow documents to contain instructions for applications.]

Processing Instructions

[16] PI ::= '' Char*)))? '?>'

[17] PITarget ::= Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))

So the PI name cannot start with ? as in your sample ??† You probably want to clean up the content before you pass it to tidy. Like below:

declare variable $xml as node() := 
   <content><![CDATA[<p>Hello <???&dagger;?>world</p>]]></content>;

declare function local:copy($input as item()*) as item()* {
  for $node in $input
   return 
     typeswitch($node)
     case text()
       return fn:replace($node,"<\?[^>]+\?>","")
     case element()
       return
          element {name($node)} {

            (: output each attribute in this element :)
            for $att in $node/@*
               return
                  attribute {name($att)} {$att}
            ,
            (: output all the sub-elements of this element recursively :)
            for $child in $node
               return local:copy($child/node())

          }
    (: otherwise pass it through.  Used for text(), comments, and PIs :)
    default return $node
};

xdmp:tidy(local:copy($xml),
  <options xmlns="xdmp:tidy">
    <assume-xml-procins>no</assume-xml-procins>
    <quiet>yes</quiet>
    <tidy-mark>no</tidy-mark>
    <enclose-text>yes</enclose-text>
    <indent>yes</indent>
  </options>)

This would do the trick to get rid of all PIs (real and fake PIs)

Regards,

Peter

Upvotes: 6

Related Questions