Cesare
Cesare

Reputation: 3

How to fix PDF/A metadata set by PDFBox (working with Docx4j and XDocReport)

In order to reach the accessibility level PDF/A-1A, I am setting XMP metadata on a PDF using PDFBox v2.0.13. Before setting the metadata I make a conversion of the file from .docx to pdf. I have tried two ways to make the conversion: one using XDocReport v.2.0.1 and the other one using Docx4j v.6.1.0.

In the Java class I have the following code:

PDDocumentInformation info = pdf.getDocumentInformation();
info.setTitle("Apache PDFBox");
info.setSubject("Apache PDFBox adding meta-data to PDF document");
info.setCreator("MyCreator");
...
DublinCoreSchema dcSchema = metadata.createAndAddDublinCoreSchema();
dcSchema.setTitle(info.getTitle());
dcSchema.setDescription(info.getSubject());
dcSchema.addCreator(info.getCreator());

Making the conversion with XDocReport I get the following metadata:

  </rdf:Description>
    <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
      <dc:title>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">Apache PDFBox</rdf:li>
        </rdf:Alt>
      </dc:title>
      <dc:description>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">Apache PDFBox adding meta-data to PDF document</rdf:li>
        </rdf:Alt>
      </dc:description>
      <dc:creator>
        <rdf:Seq>
          <rdf:li>MyCreator</rdf:li>
        </rdf:Seq>
      </dc:creator>
   </rdf:Description>

Instead making the conversion with Docx4j I get the following metadata:

    <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
      <dc:title>
        <rdf:Alt>
          <rdf:li lang="x-default">Apache PDFBox</rdf:li>
        </rdf:Alt>
      </dc:title>
      <dc:description>
        <rdf:Alt>
          <rdf:li lang="x-default">Apache PDFBox adding meta-data to PDF document</rdf:li>
        </rdf:Alt>
      </dc:description>
      <dc:creator>
        <rdf:Seq>
          <rdf:li>MyCreator</rdf:li>
        </rdf:Seq>
      </dc:creator>
    </rdf:Description>

Due to the difference of the metadata produced for "title" and "description", the final pdf produced using XDocReport results PDF/A-1A accessible, while the one produced using Docx4j is not accessible.

The accessibility check is made using VeraPDF.

Since Docx4j produces a more readable PDF, is there a way to fix the metadata in the final pdf?

Upvotes: 0

Views: 2287

Answers (2)

Tilman Hausherr
Tilman Hausherr

Reputation: 18851

This is a known problem when xmpbox is used together with certain other libraries, e.g. FOP.

It's the transformer who is the problem.

This code in XmpSerializer.java:

Transformer transformer = TransformerFactory.newInstance().newTransformer();

should return a com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl class. (Try it)

javadoc: https://docs.oracle.com/javase/7/docs/api/javax/xml/transform/TransformerFactory.html#newInstance()

"The Services API will look for a classname in the file META-INF/services/javax.xml.transform.TransformerFactory in jars available to the runtime."

You can force the default implementation by setting a system property:

System.setProperty("javax.xml.transform.TransformerFactory", "com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl");

However maybe this will mess up something in the other library.

A different solution would be to copy the source code of XmpSerializer, and to change the newInstance call like this:

Transformer transformer = TransformerFactory.newInstance("com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl", null).newTransformer();

Source

Upvotes: 0

JasonPlutext
JasonPlutext

Reputation: 15863

docx4j's export-FO uses Apache FOP (v2.3) to create a PDF.

So export-FO has the same ability to make PDF/A-1A as FOP v2.3: https://xmlgraphics.apache.org/fop/2.3/pdfa.html

So I tried:

    FOUserAgent foUserAgent = FORendererApacheFOP.getFOUserAgent(foSettings);       
    foUserAgent.getRendererOptions().put("pdf-a-mode", "PDF/A-1b");     
    // nb PDF/A-1a, PDF/A-2a and PDF/A-3a require accessibility to be enabled

But it complained:

For PDF/A-1b, all fonts, even the base 14 fonts, have to be embedded! Offending font: /Times-Roman
org.apache.fop.pdf.PDFConformanceException: For PDF/A-1b, all fonts, even the base 14 fonts, have to be embedded! Offending font: /Times-Roman
    at org.apache.fop.pdf.PDFFont.validate(PDFFont.java:170)

So you'd need to look into embedding the base 14 fonts.

As a side note, I tried PDFBox's ExtractMetadata sample on a simple PDF created using export-FO. Unfortunately, it reported:

An error ouccred when parsing the meta data: Invalid array definition, expecting Alt and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=title]

As an alternative to all of this, you could consider our commerical PDF Converter. That can produce PDF/A-2b: https://converter-eval.plutext.com/pdf_archive.html

Upvotes: 1

Related Questions