Reputation: 35
I need to split large documents (several thousands of pages and 1-2 Gb) using itext 7
I already tried to split pdf using this reference https://itextpdf.com/en/resources/examples/itext-7/splitting-pdf-file and also doing something like this:
try (PdfDocument pdfDoc = new PdfDocument(new PdfReader(outputPdfPath.toString()))) {
Files.createDirectories(Paths.get(destFolder));
int numberOfPages = pdfDoc.getNumberOfPages();
int pageNumber = 0;
while (pageNumber < numberOfPages) {
try (PdfDocument document = new PdfDocument(
new PdfWriter(destFolder + pages.get(pageNumber++).id + ".pdf"))) {
pdfDoc.copyPagesTo(pageNumber, pageNumber, document);
}
}
log.info("Provided PDF has been split into multiple.");
}
Both examples works perfectly fine but created documents are large and with lots of unused fonts, images, objects. How can I remove all this unused objects to make newly created one paged pdfs weigh less.
Upvotes: 0
Views: 1078
Reputation: 2458
The problem with your document is as follows: each page shares a lot of (maybe even all)the fonts/xobjets of the document. While coping pages, iText doesn't know whether the resources are needed on the page or not: it just copies themm and that's why you get so huge resultant pdfs.
The option you are looking for is iText's pdfSweep
.
It's general purpose is redaction of some page's content, however besides that pdfSweep
also optimizes the pages while redacting.
So how to sovle yout problem?
a) Specify the redaction area as a degenerate rectangle
b) Clean up the pages (of splitted documents or of the original document):
PdfCleanUpLocation dummyLocation = new PdfCleanUpLocation(1, new Rectangle(0, 0, 0, 0), null);
PdfDocument pdfDocument = new PdfDocument(new PdfReader(input), new PdfWriter(output));
PdfCleanUpTool cleaner = (cleanUpLocations == null)
? new PdfCleanUpTool(pdfDocument, true)
: new PdfCleanUpTool(pdfDocument, cleanUpLocations);
cleaner.cleanUp();
pdfDocument.close();
I've tried this approach to process the first of your resultant documents (which represents the first page).
The size of the document before pdfSweep
processing: 9282 KB.
The size of the document after pdfSweep
processing: 549 KB.
Upvotes: 1