How to fix PDF/A metadata set by PDFBox (working with Docx4j and XDocReport)

Question

In order to reach the accessibility level PDF/A-1A, I am setting XMP metadata on a PDF using PDFBox v2.0.13. Before setting the metadata I make a conversion of the file from .docx to pdf. I have tried two ways to make the conversion: one using XDocReport v.2.0.1 and the other one using Docx4j v.6.1.0.

In the Java class I have the following code:

PDDocumentInformation info = pdf.getDocumentInformation();
info.setTitle("Apache PDFBox");
info.setSubject("Apache PDFBox adding meta-data to PDF document");
info.setCreator("MyCreator");
...
DublinCoreSchema dcSchema = metadata.createAndAddDublinCoreSchema();
dcSchema.setTitle(info.getTitle());
dcSchema.setDescription(info.getSubject());
dcSchema.addCreator(info.getCreator());

Making the conversion with XDocReport I get the following metadata:

  
    
      
        
          Apache PDFBox
        
      
      
        
          Apache PDFBox adding meta-data to PDF document
        
      
      
        
          MyCreator

Instead making the conversion with Docx4j I get the following metadata:

    
      
        
          Apache PDFBox
        
      
      
        
          Apache PDFBox adding meta-data to PDF document
        
      
      
        
          MyCreator

Due to the difference of the metadata produced for "title" and "description", the final pdf produced using XDocReport results PDF/A-1A accessible, while the one produced using Docx4j is not accessible.

The accessibility check is made using VeraPDF.

Since Docx4j produces a more readable PDF, is there a way to fix the metadata in the final pdf?

Tilman Hausherr · Accepted Answer

This is a known problem when xmpbox is used together with certain other libraries, e.g. FOP.

It's the transformer who is the problem.

This code in XmpSerializer.java:

Transformer transformer = TransformerFactory.newInstance().newTransformer();

should return a com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl class. (Try it)

javadoc: https://docs.oracle.com/javase/7/docs/api/javax/xml/transform/TransformerFactory.html#newInstance()

"The Services API will look for a classname in the file META-INF/services/javax.xml.transform.TransformerFactory in jars available to the runtime."

You can force the default implementation by setting a system property:

System.setProperty("javax.xml.transform.TransformerFactory", "com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl");

However maybe this will mess up something in the other library.

A different solution would be to copy the source code of XmpSerializer, and to change the newInstance call like this:

Transformer transformer = TransformerFactory.newInstance("com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl", null).newTransformer();

Source

How to fix PDF/A metadata set by PDFBox (working with Docx4j and XDocReport)

Answers (2)

Related Questions