Nirmal
Nirmal

Reputation: 687

How to dense merge PDF files using PDFBox 2 without whitespace near page breaks?

We have been using the iText based PdfVeryDenseMergeTool we found in this SO question How To Remove Whitespace on Merge to merge multiple PDF files into a single PDF file. The tool merges PDFs without leaving any whitespace in between, and individual PDFs also get broken out across pages when possible.

We want to port PdfVeryDenseMergeTool to PDFBox. We found a PDFBox 2 based PdfDenseMergeTool that merges PDFs like this:

Individual PDFs:

Individual PDFs

Dense Merged PDF:

enter image description here

We are looking for something like this (this is already one in iText based PdfVeryDenseMergeTool but we want to do it using PDFBox 2) :

enter image description here

In our attempt to do the porting, we found that PdfVeryDenseMergeTool uses a PageVerticalAnalyzer that extends iText PDF Render Listener and does something every time a text, image, or arc is drawn in a PDF. And all the rendering info is then used to split an individual PDF across multiple pages. We tried looking for a similar PDF Render Listener in PDFBox 2 but found that the available PDFRenderer class only has image rendering methods. So we are not sure how to port PageVerticalAnalyzer to PDFBox.

If someone can suggest an approach to move forward, we'd greatly appreciate their help.

Thanks a lot!

EDIT 7 Feb 2020

At present, we are extending PDFGraphicsStreamEngine from PDFBox to make a custom rendering engine that tracks coordinates of images, text lines, and arcs when they are drawn. That custom engine will be the port of the PageVerticalAnalyzer. After that, we are hoping to be able to port PdfVeryDenseMergeTool to PDFBox.

EDIT 8 Feb 2020

Here is a very simple port of PageVerticalAnalyzer that handles images and text. I'm a PDFBox newbie, so my logic to handle images is probably wonky. Here's the basic approach:

Text: for every glyph printed, get the bottomY and make topY = bottomY + charHeight, mark those top/bottom points.

Image: for every call to drawImage(), it looks like there are two ways to figure out where it was drawn. First is using the coords from the last call to appendRectangle() and second is using the last calls to moveTo(), multiple lineTo(), and closePath(). I give the latter one priority. If I can't find any path (I found it in one PDF, in another, before drawImage(), I only found appendRectangle()), I use the former. If none of them exist, I have no clue what to do. Here's how I'm assuming PDFBox marks image coords using moveTo()/lineTo()/closePath():

enter image description here

Here is my current implementation:

import java.awt.geom.Point2D;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.pdfbox.contentstream.PDFGraphicsStreamEngine;
import org.apache.pdfbox.cos.COSArray;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.graphics.image.PDImage;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.util.Matrix;
import org.apache.pdfbox.util.Vector;


public class PageVerticalAnalyzer extends PDFGraphicsStreamEngine
{
    /**
     * This is a port of iText based PageVerticalAnalyzer found here
     * https://github.com/mkl-public/testarea-itext5/blob/master/src/main/java/mkl/testarea/itext5/merge/PageVerticalAnalyzer.java
     *
     * @param page PDF Page
     */
    protected PageVerticalAnalyzer(PDPage page)
    {
        super(page);
    }

    public static void main(String[] args) throws IOException
    {
        File file = new File("q2.pdf");

        try (PDDocument doc = PDDocument.load(file))
        {
            PDPage page = doc.getPage(0);
            PageVerticalAnalyzer engine = new PageVerticalAnalyzer(page);
            engine.run();

            System.out.println(engine.verticalFlips);
        }
    }

    /**
     * Runs the engine on the current page.
     *
     * @throws IOException If there is an IO error while drawing the page.
     */
    public void run() throws IOException
    {
        processPage(getPage());

        for (PDAnnotation annotation : getPage().getAnnotations())
        {
            showAnnotation(annotation);
        }
    }

    // All path related stuff

    @Override
    public void clip(int windingRule) throws IOException
    {
        System.out.println("clip");
    }

    @Override
    public void moveTo(float x, float y) throws IOException
    {
        System.out.printf("moveTo %.2f %.2f%n", x, y);
        lastPathBottomTop = new float[] {(Float) null, y};
    }

    @Override
    public void lineTo(float x, float y) throws IOException
    {
        System.out.printf("lineTo %.2f %.2f%n", x, y);
        lastLineTo = new float[] {x, y};
    }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException
    {
        System.out.printf("curveTo %.2f %.2f, %.2f %.2f, %.2f %.2f%n", x1, y1, x2, y2, x3, y3);
    }

    @Override
    public Point2D getCurrentPoint() throws IOException
    {
        // if you want to build paths, you'll need to keep track of this like PageDrawer does
        return new Point2D.Float(0, 0);
    }

    @Override
    public void closePath() throws IOException
    {
        System.out.println("closePath");
        lastPathBottomTop[0] = lastLineTo[1];
        lastLineTo = null;
    }

    @Override
    public void endPath() throws IOException
    {
        System.out.println("endPath");
    }

    @Override
    public void strokePath() throws IOException
    {
        System.out.println("strokePath");
    }

    @Override
    public void fillPath(int windingRule) throws IOException
    {
        System.out.println("fillPath");
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException
    {
        System.out.println("fillAndStrokePath");
    }

    @Override
    public void shadingFill(COSName shadingName) throws IOException
    {
        System.out.println("shadingFill " + shadingName.toString());
    }

    // Rectangle related stuff

    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException
    {
        System.out.printf("appendRectangle %.2f %.2f, %.2f %.2f, %.2f %.2f, %.2f %.2f%n",
                p0.getX(), p0.getY(), p1.getX(), p1.getY(),
                p2.getX(), p2.getY(), p3.getX(), p3.getY());

        lastRectBottomTop = new float[] {(float) p0.getY(), (float) p3.getY()};
    }

    // Image drawing

    @Override
    public void drawImage(PDImage pdImage) throws IOException
    {
        System.out.println("drawImage");
        if (lastPathBottomTop != null) {
            addVerticalUseSection(lastPathBottomTop[0], lastPathBottomTop[1]);  
        } else if (lastRectBottomTop != null ){
            addVerticalUseSection(lastRectBottomTop[0], lastRectBottomTop[1]);
        } else {
            throw new Error("Drawing image without last reference!");
        }

        lastPathBottomTop = null;
        lastRectBottomTop = null;

    }

    // All text related stuff

    @Override
    public void showTextString(byte[] string) throws IOException
    {
        System.out.print("showTextString \"");
        super.showTextString(string);
        System.out.println("\"");
    }

    @Override
    public void showTextStrings(COSArray array) throws IOException
    {
        System.out.print("showTextStrings \"");
        super.showTextStrings(array);
        System.out.println("\"");
    }

    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode,
                             Vector displacement) throws IOException
    {
        // print the actual character that is being rendered 
        System.out.print(unicode);

        super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);

        // rendering matrix seems to contain bounding box of dimensions the char
        // and an x/y point where bounding box starts
        //System.out.println(textRenderingMatrix.toString());

        // y of the bottom of the char 
        // not sure why the y value is in the 8th column
        // when I print the matrix, it shows up in the 6th column
        float yBottom = textRenderingMatrix.getValue(0, 7);

        // height of the char
        // using the value in the first column as the char height
        float yTop =  yBottom + textRenderingMatrix.getValue(0, 0);

        addVerticalUseSection(yBottom, yTop);
    }

    // Keeping track of bottom/top point pairs
    void addVerticalUseSection(float from, float to)
    {
        if (to < from)
        {
            float temp = to;
            to = from;
            from = temp;
        }

        int i=0, j=0;
        for (; i<verticalFlips.size(); i++)
        {
            float flip = verticalFlips.get(i);
            if (flip < from)
                continue;

            for (j=i; j<verticalFlips.size(); j++)
            {
                flip = verticalFlips.get(j);
                if (flip < to)
                    continue;
                break;
            }
            break;
        }
        boolean fromOutsideInterval = i%2==0;
        boolean toOutsideInterval = j%2==0;

        while (j-- > i)
            verticalFlips.remove(j);
        if (toOutsideInterval)
            verticalFlips.add(i, to);
        if (fromOutsideInterval)
            verticalFlips.add(i, from);
    }

    final List<Float> verticalFlips = new ArrayList<Float>();
    private float[] lastRectBottomTop;
    private float[] lastPathBottomTop;
    private float[] lastLineTo;

}

I am looking for answers to the following questions:

Upvotes: 1

Views: 1973

Answers (1)

mkl
mkl

Reputation: 95918

This answer suffers from the same issues as the original iText version does.

A port of the PageVerticalAnalyzer

One can port the PageVerticalAnalyzer as follows from iText to PDFBox:

public class PageVerticalAnalyzer extends PDFGraphicsStreamEngine {
    protected PageVerticalAnalyzer(PDPage page) {
        super(page);
    }

    public List<Float> getVerticalFlips() {
        return verticalFlips;
    }

    //
    // Text
    //
    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)
            throws IOException {
        super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
        Shape shape = calculateGlyphBounds(textRenderingMatrix, font, code);
        if (shape != null) {
            Rectangle2D rect = shape.getBounds2D();
            addVerticalUseSection(rect.getMinY(), rect.getMaxY());
        }
    }

    /**
     * Copy of <code>org.apache.pdfbox.examples.util.DrawPrintTextLocations.calculateGlyphBounds(Matrix, PDFont, int)</code>.
     */
    private Shape calculateGlyphBounds(Matrix textRenderingMatrix, PDFont font, int code) throws IOException
    {
        GeneralPath path = null;
        AffineTransform at = textRenderingMatrix.createAffineTransform();
        at.concatenate(font.getFontMatrix().createAffineTransform());
        if (font instanceof PDType3Font)
        {
            // It is difficult to calculate the real individual glyph bounds for type 3 fonts
            // because these are not vector fonts, the content stream could contain almost anything
            // that is found in page content streams.
            PDType3Font t3Font = (PDType3Font) font;
            PDType3CharProc charProc = t3Font.getCharProc(code);
            if (charProc != null)
            {
                BoundingBox fontBBox = t3Font.getBoundingBox();
                PDRectangle glyphBBox = charProc.getGlyphBBox();
                if (glyphBBox != null)
                {
                    // PDFBOX-3850: glyph bbox could be larger than the font bbox
                    glyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(), glyphBBox.getLowerLeftX()));
                    glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(), glyphBBox.getLowerLeftY()));
                    glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(), glyphBBox.getUpperRightX()));
                    glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(), glyphBBox.getUpperRightY()));
                    path = glyphBBox.toGeneralPath();
                }
            }
        }
        else if (font instanceof PDVectorFont)
        {
            PDVectorFont vectorFont = (PDVectorFont) font;
            path = vectorFont.getPath(code);

            if (font instanceof PDTrueTypeFont)
            {
                PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
                int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
                at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
            }
            if (font instanceof PDType0Font)
            {
                PDType0Font t0font = (PDType0Font) font;
                if (t0font.getDescendantFont() instanceof PDCIDFontType2)
                {
                    int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
                    at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
                }
            }
        }
        else if (font instanceof PDSimpleFont)
        {
            PDSimpleFont simpleFont = (PDSimpleFont) font;

            // these two lines do not always work, e.g. for the TT fonts in file 032431.pdf
            // which is why PDVectorFont is tried first.
            String name = simpleFont.getEncoding().getName(code);
            path = simpleFont.getPath(name);
        }
        else
        {
            // shouldn't happen, please open issue in JIRA
            System.out.println("Unknown font class: " + font.getClass());
        }
        if (path == null)
        {
            return null;
        }
        return at.createTransformedShape(path.getBounds2D());
    }

    //
    // Bitmaps
    //
    @Override
    public void drawImage(PDImage pdImage) throws IOException {
        Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
        Section section = null;
        for (int x = 0; x < 2; x++) {
            for (int y = 0; y < 2; y++) {
                Point2D.Float point = ctm.transformPoint(x, y);
                if (section == null)
                    section = new Section(point.y);
                else
                    section.extendTo(point.y);
            }
        }
        addVerticalUseSection(section.from, section.to);
    }

    //
    // Paths
    //
    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
        subPath = null;
        Section section = new Section(p0.getY());
        section.extendTo(p1.getY()).extendTo(p2.getY()).extendTo(p3.getY());
        currentPoint = p0;
    }

    @Override
    public void clip(int windingRule) throws IOException {
    }

    @Override
    public void moveTo(float x, float y) throws IOException {
        subPath = new Section(y);
        path.add(subPath);
        currentPoint = new Point2D.Float(x, y);
    }

    @Override
    public void lineTo(float x, float y) throws IOException {
        if (subPath == null) {
            subPath = new Section(y);
            path.add(subPath);
        } else
            subPath.extendTo(y);
        currentPoint = new Point2D.Float(x, y);
    }

    /**
     * Beware! This is incorrect! The control points may be outside
     * the vertically used range 
     */
    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {
        if (subPath == null) {
            subPath = new Section(y1);
            path.add(subPath);
        } else
            subPath.extendTo(y1);
        subPath.extendTo(y2).extendTo(y3);
        currentPoint = new Point2D.Float(x3, y3);
    }

    @Override
    public Point2D getCurrentPoint() throws IOException {
        return currentPoint;
    }

    @Override
    public void closePath() throws IOException {
    }

    @Override
    public void endPath() throws IOException {
        path.clear();
        subPath = null;
    }

    @Override
    public void strokePath() throws IOException {
        for (Section section : path) {
            addVerticalUseSection(section.from, section.to);
        }
        path.clear();
        subPath = null;
    }

    @Override
    public void fillPath(int windingRule) throws IOException {
        for (Section section : path) {
            addVerticalUseSection(section.from, section.to);
        }
        path.clear();
        subPath = null;
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException {
        for (Section section : path) {
            addVerticalUseSection(section.from, section.to);
        }
        path.clear();
        subPath = null;
    }

    @Override
    public void shadingFill(COSName shadingName) throws IOException {
        // TODO Auto-generated method stub
    }

    Point2D currentPoint = null;

    List<Section> path = new ArrayList<Section>();
    Section subPath = null;

    static class Section {
        Section(double value) {
            this((float)value);
        }

        Section(float value) {
            from = value;
            to = value;
        }

        Section extendTo(double value) {
            return extendTo((float)value);
        }

        Section extendTo(float value) {
            if (value < from)
                from = value;
            else if (value > to)
                to = value;
            return this;
        }

        private float from;
        private float to;
    }

    void addVerticalUseSection(double from, double to) {
        addVerticalUseSection((float)from, (float)to);
    }

    void addVerticalUseSection(float from, float to) {
        if (to < from) {
            float temp = to;
            to = from;
            from = temp;
        }

        int i=0, j=0;
        for (; i<verticalFlips.size(); i++) {
            float flip = verticalFlips.get(i);
            if (flip < from)
                continue;

            for (j=i; j<verticalFlips.size(); j++) {
                flip = verticalFlips.get(j);
                if (flip < to)
                    continue;
                break;
            }
            break;
        }
        boolean fromOutsideInterval = i%2==0;
        boolean toOutsideInterval = j%2==0;

        while (j-- > i)
            verticalFlips.remove(j);
        if (toOutsideInterval)
            verticalFlips.add(i, to);
        if (fromOutsideInterval)
            verticalFlips.add(i, from);
    }

    final List<Float> verticalFlips = new ArrayList<Float>();
}

(PageVerticalAnalyzer.java)

The implementation actually is similar to that of the BoundingBoxFinder from this answer. Just like there I borrowed from the PDFBox example DrawPrintTextLocations to determine text outlines.

Furthermore, there is an issue in the curveTo processing corresponding to that of the original iText5 PageVerticalAnalyzer from this answer, the control points are treated as if they were on the actual curve but they actually usually are not and can be far outside the vertical use range of the curve. Instead of the path processing as implemented here one could use corresponding AWT classes but that may not be possible on Android etc.

And just like there this class ignores annotations, but the iText5 dense merger also ignored annotations. And this class also ignores the clip path...

A port of the PdfVeryDenseMergeTool

public class PdfVeryDenseMergeTool {
    public PdfVeryDenseMergeTool(PDRectangle size, float top, float bottom, float gap)
    {
        this.pageSize = size;
        this.topMargin = top;
        this.bottomMargin = bottom;
        this.gap = gap;
    }

    public void merge(OutputStream outputStream, Iterable<PDDocument> inputs) throws IOException
    {
        try
        {
            openDocument();
            for (PDDocument input: inputs)
            {
                merge(input);
            }
            if (currentContents != null) {
                currentContents.close();
                currentContents = null;
            }
            document.save(outputStream);
        }
        finally
        {
            closeDocument();
        }
        
    }

    void openDocument() throws IOException
    {
        document = new PDDocument();
        newPage();
    }

    void closeDocument() throws IOException
    {
        try
        {
            if (currentContents != null) {
                currentContents.close();
                currentContents = null;
            }
            document.close();
        }
        finally
        {
            this.document = null;
            this.yPosition = 0;
        }
    }
    
    void newPage() throws IOException
    {
        if (currentContents != null) {
            currentContents.close();
            currentContents = null;
        }
        currentPage = new PDPage(pageSize);
        document.addPage(currentPage);
        yPosition = pageSize.getUpperRightY() - topMargin;
        currentContents = new PDPageContentStream(document, currentPage);
    }

    void merge(PDDocument input) throws IOException
    {
        for (PDPage page : input.getPages())
        {
            merge(input, page);
        }
    }

    void merge(PDDocument sourceDoc, PDPage page) throws IOException
    {
        PDRectangle pageSizeToImport = page.getCropBox();

        PageVerticalAnalyzer analyzer = new PageVerticalAnalyzer(page);
        analyzer.processPage(page);
        List<Float> verticalFlips = analyzer.getVerticalFlips();
        if (verticalFlips.size() < 2)
            return;

        LayerUtility layerUtility = new LayerUtility(document);
        PDFormXObject form = layerUtility.importPageAsForm(sourceDoc, page);

        int startFlip = verticalFlips.size() - 1;
        boolean first = true;
        while (startFlip > 0)
        {
            if (!first)
                newPage();

            float freeSpace = yPosition - pageSize.getLowerLeftY() - bottomMargin;
            int endFlip = startFlip + 1;
            while ((endFlip > 1) && (verticalFlips.get(startFlip) - verticalFlips.get(endFlip - 2) < freeSpace))
                endFlip -=2;
            if (endFlip < startFlip)
            {
                float height = verticalFlips.get(startFlip) - verticalFlips.get(endFlip);

                currentContents.saveGraphicsState();
                currentContents.addRect(0, yPosition - height, pageSizeToImport.getWidth(), height);
                currentContents.clip();
                Matrix matrix = Matrix.getTranslateInstance(0, (float)(yPosition - (verticalFlips.get(startFlip) - pageSizeToImport.getLowerLeftY())));
                currentContents.transform(matrix);
                currentContents.drawForm(form);
                currentContents.restoreGraphicsState();

                yPosition -= height + gap;
                startFlip = endFlip - 1;
            }
            else if (!first) 
                throw new IllegalArgumentException(String.format("Page %s content sections too large.", page));
            first = false;
        }
    }

    PDDocument document = null;
    PDPage currentPage = null;
    PDPageContentStream currentContents = null;
    float yPosition = 0; 

    final PDRectangle pageSize;
    final float topMargin;
    final float bottomMargin;
    final float gap;
}

(PdfVeryDenseMergeTool.java)

This essentially is a simple port of the iText 5 PdfVeryDenseMergeTool, nothing special about it.

Usage of the PdfVeryDenseMergeTool

One simply creates a PdfVeryDenseMergeTool instance with format information and then starts the merge using PDDocument instances as sources:

PDDocument document1 = ...;
...
PDDocument documentN = ...;

PdfVeryDenseMergeTool tool = new PdfVeryDenseMergeTool(PDRectangle.A4, 30, 30, 10);
tool.merge(new FileOutputStream(RESULT_FILE), Arrays.asList(document1, ..., documentN));

(DenseMerging test testVeryDenseMerging)

Upvotes: 3

Related Questions