Dave Cassel
Dave Cassel

Reputation: 8422

How to preserve the HTML5 doctype in MarkLogic?

We have a requirement to store and retrieve well-formed HTML5 documents in MarkLogic using the Java Client API or REST API.

Each document has an '.html' extension and the standard HTML5 doctype . When documents are inserted, by default they get stored as text documents.

We would like to use all the goodness that MarkLogic provides for search and manipulation of the documents as if they were XHTML, but we need to preserve the HTML5 doctype and .html extension for compatibility with other tools. I am sure we are not the only ones to have encountered this scenario.

We have tried changing the HTML mimetype to XML but when documents are inserted the doctype gets replaced with the XML doctype. Is there a way to insert and retrieve well formed HTML5 documents without losing the doctype?

Upvotes: 2

Views: 335

Answers (2)

ehennum
ehennum

Reputation: 7335

Expanding a bit on WST's answer, you could store the document as XHTML and do the conversion in a REST API transform with

  • the xdmp:quote() function in an XQuery transform,
  • the xsl:output statement in an XSLT transform, or
  • the xdmp.quote() function in a JavaScript transform in MarkLogic 8.

A possible XQuery transform for the REST API:

xquery version "1.0-ml";
module namespace html5ifier =
    "http://marklogic.com/rest-api/transform/html5ifier";

declare default function namespace "http://www.w3.org/2005/xpath-functions";
declare option xdmp:mapping "false";

declare function html5ifier:transform(
    $context as map:map,
    $params  as map:map,
    $content as document-node() 
) as document-node()
{
    map:put($context,"output-type","text/html"),

    document{text{
        xdmp:quote($content,
            <options xmlns="xdmp:quote">
                <method>html</method>
                <media-type>text/html</media-type>
                <doctype-public>html</doctype-public>
            </options>)
        }}
};

If your REST server was on port 8011, you would install the transform with a PUT request:

http://localhost:8011/v1/config/transforms/html5ifier

Then, you could GET the persisted XHTML document as HTML5 using the transform

http://localhost:8011/v1/documents?uri=/path/to/the/doc.xhtml \
    &transform=html5ifier

You could make additional changes to the XHTML document within the transform (either on the XML before quoting or on the string after quoting).

See also:

http://markmail.org/message/qmsos7np64ohyctp

Upvotes: 1

wst
wst

Reputation: 11771

There is no native way to keep the doctype in the database (XQuery doesn't support doctypes). But using some logic you could add add the doctype back when a document is requested.

For example:

declare function local:get-with-doctype(
    $document as document-node()
) as xs:string
{
    if (ends-with(xdmp:node-uri($document), '.html')
    then document { 
      text{ '<!DOCTYPE html>' }, 
      xdmp:quote($document) 
    }
    else $document
};

Alternatively, you could parse the doctype out of the document when it's inserted and store it in a document property. Then when the document is requested, you could always add the one from the property. However, that would probably only be worth it if you were required to handle many doctypes.

Upvotes: 1

Related Questions