Fan Li
Fan Li

Reputation: 1077

Load Text Files as Binary Using MarkLogic REST APIs

Is it possible to load a text file, regardless of its content, as a binary document through the MarkLogic REST APIs? More specifically through a resource extension end point?

I see it is possible through the xdmp:document-load function but not quite sure how to do it using the REST APIs.

xdmp:document-load("C:\my\path\test.txt",
    map:map() => map:with("uri", "/test/test.txt")
              => map:with("format", "binary")
)

I have tried to load the same document through the PUT /v1/documents API and set the format parameter to be binary. But it was still loaded as a text file.

The use case is that I need to ingest a bunch of attachment files which occasionally include some text files. I don't need MarkLogic to index their content and in fact many of those files have encoding or format issues if MarkLogic attempts to do so.

Thank you!

Upvotes: 1

Views: 108

Answers (1)

Mads Hansen
Mads Hansen

Reputation: 66783

With /v1/documents PUT, the format parameter is used to indicate the format of the metadata, not the document.

As described in Controlling Input and Output Content Type

  • Primary: URI extension MIME type mapping, as long as the request does not specify a transform function.
  • Fallback: Content-type header MIME type mapping. For multipart input, the request Content-type header must be multipart/mixed, so the Content-type header for each part specifies the MIME type of the content for that part.

The resource file extension from the document URI is used to look for a configured Mimetype. It will use the format for the configured Mimetype, if there is a matching entry.

Unfortunately, the explicit Content-type header does not override the implicit format determination. So, if you want to load document that have a .txt file extension as binary() documents then you will need to implement some workarounds.

In order to load the text documents as binary() with /v1/documents PUT you could:

  • Use a different file extension. Append ".bin" to the end of the text file URIs i.e. /myTextFile.txt.bin. That may not be desired, since it does change the URI of the documents from what it really is, but does indicate that the text doc is being stored as a binary document.
  • Apply a custom transformation when loading the documents and specify the desired Content-type

An example of a passthrough transform that could be applied, so that the implicit URL format detection is not applied, and the explicit Content-type header is applied:

function noop(context, params, content){
  return content;
} 
exports.transform=noop

After installing the custom transform with the name noop: Below is an example curl command that installs the noop transform. Update the username/password as appropriate:

curl --anyauth --user myUsername:myPassword -X PUT -i -d "function noop(context, params, content){return content;} exports.transform=noop" -H "Content-type: application/vnd.marklogic-javascript" http://localhost:8000/LATEST/config/transforms/noop

It is then possible to invoke /v1/documents PUT and specify Content-type as a binary Mimetype (in this example, as application-octet-stream):

curl --anyauth --user myUsername:myPassword -T ./test.txt -i -H "Content-type: application/octet-stream" "http://localhost:8000/v1/documents?uri=/test.txt&transform=noop"

and it will be loaded as binary() instead of text()

doc("/test.txt")/node()/xdmp:node-kind(.)

yields: binary

Upvotes: 3

Related Questions