Reputation: 3
In order to reach the accessibility level PDF/A-1A, I am setting XMP metadata on a PDF using PDFBox v2.0.13. Before setting the metadata I make a conversion of the file from .docx to pdf. I have tried two ways to make the conversion: one using XDocReport v.2.0.1 and the other one using Docx4j v.6.1.0.
In the Java class I have the following code:
PDDocumentInformation info = pdf.getDocumentInformation();
info.setTitle("Apache PDFBox");
info.setSubject("Apache PDFBox adding meta-data to PDF document");
info.setCreator("MyCreator");
...
DublinCoreSchema dcSchema = metadata.createAndAddDublinCoreSchema();
dcSchema.setTitle(info.getTitle());
dcSchema.setDescription(info.getSubject());
dcSchema.addCreator(info.getCreator());
Making the conversion with XDocReport I get the following metadata:
</rdf:Description>
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">Apache PDFBox</rdf:li>
</rdf:Alt>
</dc:title>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default">Apache PDFBox adding meta-data to PDF document</rdf:li>
</rdf:Alt>
</dc:description>
<dc:creator>
<rdf:Seq>
<rdf:li>MyCreator</rdf:li>
</rdf:Seq>
</dc:creator>
</rdf:Description>
Instead making the conversion with Docx4j I get the following metadata:
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
<dc:title>
<rdf:Alt>
<rdf:li lang="x-default">Apache PDFBox</rdf:li>
</rdf:Alt>
</dc:title>
<dc:description>
<rdf:Alt>
<rdf:li lang="x-default">Apache PDFBox adding meta-data to PDF document</rdf:li>
</rdf:Alt>
</dc:description>
<dc:creator>
<rdf:Seq>
<rdf:li>MyCreator</rdf:li>
</rdf:Seq>
</dc:creator>
</rdf:Description>
Due to the difference of the metadata produced for "title" and "description", the final pdf produced using XDocReport results PDF/A-1A accessible, while the one produced using Docx4j is not accessible.
The accessibility check is made using VeraPDF.
Since Docx4j produces a more readable PDF, is there a way to fix the metadata in the final pdf?
Upvotes: 0
Views: 2287
Reputation: 18851
This is a known problem when xmpbox is used together with certain other libraries, e.g. FOP.
It's the transformer who is the problem.
This code in XmpSerializer.java:
Transformer transformer = TransformerFactory.newInstance().newTransformer();
should return a com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl
class. (Try it)
javadoc: https://docs.oracle.com/javase/7/docs/api/javax/xml/transform/TransformerFactory.html#newInstance()
"The Services API will look for a classname in the file META-INF/services/javax.xml.transform.TransformerFactory in jars available to the runtime."
You can force the default implementation by setting a system property:
System.setProperty("javax.xml.transform.TransformerFactory", "com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl");
However maybe this will mess up something in the other library.
A different solution would be to copy the source code of XmpSerializer, and to change the newInstance call like this:
Transformer transformer = TransformerFactory.newInstance("com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl", null).newTransformer();
Upvotes: 0
Reputation: 15863
docx4j's export-FO uses Apache FOP (v2.3) to create a PDF.
So export-FO has the same ability to make PDF/A-1A as FOP v2.3: https://xmlgraphics.apache.org/fop/2.3/pdfa.html
So I tried:
FOUserAgent foUserAgent = FORendererApacheFOP.getFOUserAgent(foSettings);
foUserAgent.getRendererOptions().put("pdf-a-mode", "PDF/A-1b");
// nb PDF/A-1a, PDF/A-2a and PDF/A-3a require accessibility to be enabled
But it complained:
For PDF/A-1b, all fonts, even the base 14 fonts, have to be embedded! Offending font: /Times-Roman
org.apache.fop.pdf.PDFConformanceException: For PDF/A-1b, all fonts, even the base 14 fonts, have to be embedded! Offending font: /Times-Roman
at org.apache.fop.pdf.PDFFont.validate(PDFFont.java:170)
So you'd need to look into embedding the base 14 fonts.
As a side note, I tried PDFBox's ExtractMetadata sample on a simple PDF created using export-FO. Unfortunately, it reported:
An error ouccred when parsing the meta data: Invalid array definition, expecting Alt and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=title]
As an alternative to all of this, you could consider our commerical PDF Converter. That can produce PDF/A-2b: https://converter-eval.plutext.com/pdf_archive.html
Upvotes: 1