Reputation: 161
I am trying to split a large PDF of type document bundle. This PDF has an index page which links to different pages eg.
Index:
Topic 1: page 1-5
Topic 2: page 12-25
I am currently using PDFbox to laod the PDF and get the page numbers but I am looking for a way to get the metadata to allow me to group the pages by their topics
If there a way of retrieving the document structure so I can group break the document down into smaller PDFs eg. Topic 1 now becomes a Single PDF with pages 1-5 merged.
Here is the code:
PDDocumentOutline outline = pdocument.getDocumentCatalog().getDocumentOutline();
for (PDOutlineItem item : outline.children()) {
String pageTitle=item.getTitle(); //Topic 1
PDPage destinationPage=item.findDestinationPage(pdocument);
//How do I get actual pageNumber of Page?
//How do I get Destination reference string ie. pg 1-5
}
Upvotes: 0
Views: 1278
Reputation: 161
PDDocumentOutline outline = pdocument.getDocumentCatalog().getDocumentOutline();
PDPageTree pageTree = pdocument.getPages();
for (PDOutlineItem item : outline.children()) {
String pageTitle=item.getTitle(); //Topic 1
PDPage destinationPage=item.findDestinationPage(pdocument);
PDPage currentPage = item.findDestinationPage(pdocument);
int startPg = pageTree.indexOf(currentPage);
PDPage nextIndexPage = item.getNextSibling().findDestinationPage(pdocument);
int endPg = pageTree.indexOf(nextIndexPage);
PDDocument document = new PDDocument();
for (int i = startPg; i < endPg; i++) {
PDPage incomingPage = pageTree.get(i);
document.addPage(incomingPage);
}
document.save(targetPath + item.getTitle() + ".pdf");
document.close();
}
Upvotes: 1
Reputation: 830
You may wanna have a look at section 12.3.3 "Document Outline" in the PDF 1.7 specification. The document outline is a tree structure providing links to various parts of the document. For example, if you convert a LibreOffice document to PDF the headings would be used for the outline.
If your PDF has such an outline, you can use it to split it.
If it only has an index page, there may be PDF tags (see section 14.8 "Tagged PDF") available for easily getting the needed data.
If there are no PDF tags, you would probably need to parse the text and analyse it to get the needed information.
Upvotes: 1