Puneet Pant
Puneet Pant

Reputation: 948

How to upload docx, xlsx & txt files to Marklogic Server?

I have a folder which contains doc, docx, xlsx, pdf and txt files. I am uploading all these files into Marklogic with this XQuery:-

for $d in xdmp:filesystem-directory("C:\uploads")//dir:entry
return 
  xdmp:document-load($d//dir:pathname,
    <options xmlns="xdmp:document-load">
    <uri>{concat("/documents/", string($d//dir:filename))}</uri>
    <permissions>{xdmp:default-permissions()}</permissions>
    <collections>{xdmp:default-collections()}</collections>
    <format>binary</format>
    </options>)

I have also installed content processing for my database. Now when I upload doc and pdf files they get converted to xml & xhtml files. But docx, xlsx, & txt do not get converted. Can somebody tell me why these files are not getting converted?

Upvotes: 1

Views: 1216

Answers (1)

wpaven
wpaven

Reputation: 409

Enable the Office OpenXML Extract pipeline to convert the .docx, .xlsx, and .pptx files.

Files with these extensions are already XML. If you were to change their extension to .zip, you could extract and see the files are just composed of interrelated XML parts.

The Office OpenXML Extract pipeline will unzip Office 2007/2010 files and store their requisite parts in a directory sibling to the main file, similar to the other conversion pipelines. This pipeline allows you to store the raw Open XML. There is no further conversion to XHTML of DocBook at this time.

There is no conversion for .txt that I'm aware of. Those are just text files and will be inserted as text in MarkLogic. You could convert to XML by simply wrapping the text in a parent element and changing the file extension to .xml.

Hope this helps.

Upvotes: 6

Related Questions