Skary
Skary

Reputation: 417

PDFBOX: Merge adds unused Fonts, how to remove it

i merge two PDF Files into one with PDFBOX Version 2. The First one got Fonts:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
XXMGEM+Arial-BoldMT                  TrueType          WinAnsi          yes yes yes     15  0
XXMGEM+ArialMT                       TrueType          WinAnsi          yes yes yes     19  0
XXMGEM+ArialMT                       CID TrueType      Identity-H       yes yes yes     27  0
XXMGEM+ArialNarrow-Bold              TrueType          WinAnsi          yes yes yes     40  0
XXMGEM+ArialNarrow                   TrueType          WinAnsi          yes yes yes     44  0

and the Second one:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
UNTWVR+HelveticaLTCom-Roman          CID TrueType      Identity-H       yes yes yes     25  0
UNTYID+HelveticaLTCom-Bold           CID TrueType      Identity-H       yes yes yes     26  0
UNTZUP+ArialMT                       CID TrueType      Identity-H       yes yes yes     27  0
UNUBHB+Arial-BoldMT                  CID TrueType      Identity-H       yes yes yes     28  0
Helvetica-Bold                       Type 1            WinAnsi          no  no  no      29  0
UNXPUH+HelveticaLTCom-Roman          CID TrueType      Identity-H       yes yes yes     50  0
UNXRGT+HelveticaLTCom-Bold           CID TrueType      Identity-H       yes yes yes     51  0
UNXSTF+ArialMT                       CID TrueType      Identity-H       yes yes yes     52  0
UNXUFR+Arial-BoldMT                  CID TrueType      Identity-H       yes yes yes     53  0

After Merging, this happens:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
SRWYVL+HelveticaLTCom-Roman          CID TrueType      Identity-H       yes yes yes    420  0
SRXAHX+HelveticaLTCom-Bold           CID TrueType      Identity-H       yes yes yes    421  0
SRXBUJ+ArialMT                       CID TrueType      Identity-H       yes yes yes    422  0
SRXDGV+Arial-BoldMT                  CID TrueType      Identity-H       yes yes yes    423  0
Helvetica-Bold                       Type 1            WinAnsi          no  no  no     424  0
SRWYVL+HelveticaLTCom-Roman          CID TrueType      Identity-H       yes yes yes    425  0
SRXAHX+HelveticaLTCom-Bold           CID TrueType      Identity-H       yes yes yes    426  0
SRXBUJ+ArialMT                       CID TrueType      Identity-H       yes yes yes    427  0
SRXDGV+Arial-BoldMT                  CID TrueType      Identity-H       yes yes yes    428  0
SRWYVL+ArialMT                       CID TrueType      Identity-H       yes yes yes    429  0
SRXAHX+HelveticaLTCom-Roman          CID TrueType      Identity-H       yes yes yes    430  0
SRXBUJ+HelveticaLTCom-Bold           CID TrueType      Identity-H       yes yes yes    431  0
SRXDGV+Arial-BoldMT                  CID TrueType      Identity-H       yes yes yes    432  0
WDEGAT+Arial-BoldMT                  TrueType          WinAnsi          yes yes yes    436  0
GSEDXU+ArialMT                       TrueType          WinAnsi          yes yes yes    437  0
Arial                                TrueType          WinAnsi          yes no  no     416  0
ZapfDingbats                         TrueType          WinAnsi          yes no  yes    419  0
ArialNarrow                          TrueType          WinAnsi          yes no  no     417  0
ACHRDX+ZapfDingbats                  TrueType          WinAnsi          yes yes yes    618  0
ACHRDX+ZapfDingbats                  TrueType          WinAnsi          yes yes yes    619  0
ACHRDX+ZapfDingbats                  TrueType          WinAnsi          yes yes yes    620  0
ACHRDX+ZapfDingbats                  TrueType          WinAnsi          yes yes yes    621  0
ACHRDX+ZapfDingbats                  TrueType          WinAnsi          yes yes yes    622  0
GSEDXU+ArialNarrow-Bold              TrueType          WinAnsi          yes yes yes    560  0
NVGLHQ+ArialNarrow                   TrueType          WinAnsi          yes yes yes    561  0
KWHHMM+ArialMT                       CID TrueType      Identity-H       yes yes yes    578  0

My Code in Java:

final PDFMergerUtility pdfMerger = new PDFMergerUtility();
            pdfMerger.setDestinationStream(outputStream);
            pdfMerger.addSources(additionalPdfStreams);
            pdfMerger.addSource(inputStreamPdDocument);
            pdfMerger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());

The Problem is that an Api from a third party vendor got an Problem with this Fonts. So : What am i doing wrong and how can i remove the unused and doubled fonts ??

Upvotes: 9

Views: 3196

Answers (2)

Jin Thakur
Jin Thakur

Reputation: 2773

The Merge Program is giving its own replacement fonts as it cant find or use the existing fonts embedded in Pdf. You have to rewrite code and use the specific font . extract all fonts from pdf Then merge them make sure you use the same font.

Try using system fonts that should not cause any issue to third party Custom fonts are always an issue. Some API does not take any unsigned fonts.

Check this font name type encoding emb sub uni object ID


XXMGEM+Arial-BoldMT TrueType WinAnsi yes yes yes 15 0

This font was in first file but after combining them this font is missing.

Upvotes: 0

isapir
isapir

Reputation: 23513

The "duplication" issue seems like it's coming from multiple pages, because each page contains its own font metadata. If you iterate over the pages and get the font names, then you will see duplicates in the output if a font is used in more than one page.

Something seems very wrong with the details in the question though. Neither of the source files have ZapfDingbats font, so where did it come from into the merged document?

First, I wrote a couple of helper methods:

static String mergePdfs(InputStream is1, InputStream is2) throws IOException {
    PDFMergerUtility pdfMerger = new PDFMergerUtility();
    pdfMerger.addSource(is1);
    pdfMerger.addSource(is2);

    String destFile = System.getProperty("java.io.tmpdir") + System.nanoTime() + ".pdf";
    pdfMerger.setDestinationFileName(destFile);
    pdfMerger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());

    return destFile;
}

static List<String> getFontNames(PDDocument doc) throws IOException {
    List<String> result = new ArrayList<>();
    for (int i=0; i < doc.getNumberOfPages(); i++){
        PDPage page = doc.getPage(i);
        PDResources res = page.getResources();
        for (COSName fontName : res.getFontNames()) {
            result.add(res.getFont(fontName).toString());
        }
    }

    return result;
}

Then I created 3 test PDF documents. The first 2, test-pdf-1.pdf and test-pdf-2.pdf contain one page each and use the same two fonts: PDTrueTypeFont BAAAAA+ArialMT and PDTrueTypeFont CAAAAA+Roboto-Black. The 3rd one, test-pdf-3.pdf, contains 2 pages from the first two documents, and was created with a text editor and not with PDFBox.

And then added the following test code:

Class clazz = Test.class;
String src1, src2, src3;
src1 = "/test-pdf-1.pdf";
src2 = "/test-pdf-2.pdf";
src3 = "/test-pdf-3.pdf";

InputStream is1, is2, is3;
is1 = clazz.getResourceAsStream(src1);
is2 = clazz.getResourceAsStream(src2);

String merged = mergePdfs(is1, is2);

PDDocument doc1, doc2, doc3, doc4;

is1 = clazz.getResourceAsStream(src1);
doc1 = PDDocument.load(is1);

is2 = clazz.getResourceAsStream(src2);
doc2 = PDDocument.load(is2);

is3 = clazz.getResourceAsStream(src3);
doc3 = PDDocument.load(is3);

doc4 = PDDocument.load(new File(merged));

System.out.println(src1 + " >\n\t" + getFontNames(doc1));
System.out.println(src2 + " >\n\t" + getFontNames(doc2));
System.out.println(src3 + " >\n\t" + getFontNames(doc3));
System.out.println(merged  + " >\n\t" + getFontNames(doc4));

The output is as follows (I truncated the last file name for readability and easier comparison):

/test-pdf-1.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
/test-pdf-2.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
/test-pdf-3.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black, PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]
C:\Temp\..9.pdf >
[PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black, PDTrueTypeFont BAAAAA+ArialMT, PDTrueTypeFont CAAAAA+Roboto-Black]

You can see that both the file created by PDFBox's merge, "C:\temp\7193671804393899.pdf" (abbreviated in the output for readability), and the file "test-pdf-3.pdf" which was created with an editor have the same output for fonts, showing each font twice, one for each page.

Opening the merged file in Acrobat Reader confirms that only one copy of the fonts exists:

C:\temp\7193671804393899.pdf Properties > Fonts

Upvotes: 6

Related Questions