Reputation: 41
I have been trying to convert my docX files to a XML I have custom-made. My users want their data converted to this XML for easier content query in their web app and they want the input to be from their docX.
I have tried looking for converter API in Java but none seem to fit my requirement. I have looked into docx4j but realized that it only converts to HTML and PDF. I am thinking if there exists a converter API to which I can input, say, an intermediate translator (XSLT) and the output would be my custom XML complete with the data from my docX.
Is there an existing tool for this? If there is none, any suggestions on the approach I have to take in coding my own converter e.g. from openXML, convert to XSL-FO first before the custom XML?
Would love to hear from the community.
Thank you very much.
Upvotes: 4
Views: 6700
Reputation: 15863
docx4j can be used to convert OpenXML to arbitrary XML via XSLT.
Assuming Templates xslt and javax.xml.transform.stream.StreamResult result, you'd do something like this:
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));
MainDocumentPart mdp = wordMLPackage.getMainDocumentPart();
// DOM document to input to transform
org.w3c.dom.Document doc = XmlUtils.marshaltoW3CDomDocument(
mdp.getJaxbElement() );
XmlUtils.transform(doc, xslt, null, result);
However, if all you want to do is to transform to XML, then docx4j (and Apache POI for that matter), are overkill. You could just use OpenXML4J directly.
Whether conversion via XSLT is the best approach though, depends on whether your target XML is document-oriented, or data-oriented.
If it is document-oriented, XSLT is a good approach.
If it is data-oriented, you might want to consider content control data-binding. (There was another approach, called customxml, but the i4i patent farce may make that approach inadvisable if you are relying on Word for editing)
Upvotes: 3
Reputation: 887
I've had the most luck saving docx as html right from Word. The Html is not xHtml so you'd need to run a tidy on it. Otherwise, it works fairly well if you must use a Word-based workflow. You can write a VBA script to have Word open a file and save it to Html programmatically, too.
Upvotes: 0
Reputation: 14006
To the best of my knowledge, docx files are simply xml files in a ZIP container. To convert these to some XML format of your design, you would need to unzip the file (into new folder or into memory), load the target Xml document, and apply your XSLT to that xml file. I don't think you mention anything about your development environment, except the "docx4j" tag.. Are you developing in Java? If so, I'm afraid I wouldn't know what libraries to point you to for the zip-handling and xml-transformation libraries (although I know they exist, and it would only take a 5-minute google search to find them!)
To check out the xml files in a docx, simply change the extension of the file from ".docx" to ".zip" and open in your favorite ZIP archive tool.
Upvotes: 1