Santiago Cometto
Santiago Cometto

Reputation: 1

How to deal with conversion of non-consistent DOCX files to PDF?

I'm struggling with the conversion of DOCX files to PDF.

In simple words, im selecting DOCX files from a database (bytes), merging them with APACHE POI and then trying to convert them to PDF using Docx4j, all in memory (in production, I can't work with files alocated in the local storage, it's just for testing).

The fact here is that the DOCX files from the database aren't consistent, maybe some metadata or properties are missing.

This is the method to merge DOCX files into one single DOCX file (manipulated as XML in memory)

public XWPFDocument mergeDocx(List<String> docxNames) throws Exception {
    List<FileData> fileData = repository.getDocxs(docxNames);
    ZipSecureFile.setMinInflateRatio(0);

    InputStream inputS = new ByteArrayInputStream(fileData.get(0).getData());
    OPCPackage opcPackage = OPCPackage.open(inputS);
    XWPFDocument xwpfDocument = new XWPFDocument(opcPackage);

    fileData.remove(0);

    if (!fileData.isEmpty()) {
        for (FileData fd : fileData) {
            inputS = new ByteArrayInputStream(fd.getData());
            opcPackage = OPCPackage.open(inputS);
            XWPFDocument xwpf = new XWPFDocument(opcPackage);

            CTBody bodyToAppend = xwpf.getDocument().getBody();
            xwpfDocument.getDocument().addNewBody().set(bodyToAppend);
        }
    }
    inputS.close();
    opcPackage.close();
    return xwpfDocument;
}

Both the final merged DOCX file and the selected from the DB ones are "broken" and can't work properly in the second method, if I want it to work, I have to create a local file of that final merged file and pass it through a DOCX converter

public void toPdf(XWPFDocument docxDocument) throws Exception {
    //in
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    docxDocument.write(baos);
    docxDocument.close();
    byte[] bytes = baos.toByteArray(); //this is basically a ByteArray of an XML file, not a consistent DOCX one
    WordprocessingMLPackage ml = Docx4J.load(new ByteArrayInputStream(bytes));

    //out
    OutputStream output = new FileOutputStream("/Users/Santiago/Documents/test.pdf");
    Docx4J.toPDF(ml, output);
    output.flush();
    output.close();

}

The question here is, is there any way to have a consistent DOCX file (maybe adding some properties or applying some formatting) before going through the second method? Without resorting to external sources like the web-app I'm using to convert my "bad" docx file to a consistent one

Upvotes: 0

Views: 247

Answers (1)

JasonPlutext
JasonPlutext

Reputation: 15878

Your merge code doesn't do what you think it does.

If all you need is PDF output, then create the PDFs first, then merge them using PdfBox.

If you want to do other stuff with your merged docx, then you could use the commercial Docx4j Enterprise to do the merge.

Upvotes: 1

Related Questions