Reputation: 97
Does a PDPage object contains a reference to the PDDocument to which it belongs?
In other words, does a PDPage has knowledge of its PDDocument?
Somewhere in the application I have a list of PDDocuments.
These documents get merged into one new PDDocument:
PDFMergerUtility pdfMerger = new PDFMergerUtility();
PDDocument mergedPDDocument = new PDDocument();
for (PDDocument pdfDocument : documentList) {
pdfMerger.appendDocument(mergedPDDocument, pdfDocument);
}
Then this PdDocument gets split into bundles of 10:
Splitter splitter = new Splitter();
splitter.setSplitAtPage(bundleSize);
List<PDDocument> bundleList = splitter.split(mergedDocument);
My question with this is now:
if I loop over the pages of these splitted PDDocuments in the list, is there a way to know to which PDDocument a page originally belonged?
Also, if you have a PDPage object, can you get information from it like, it's pagenumber, ....? Or can you get this via another way?
Upvotes: 1
Views: 1766
Reputation: 216
PDPage
object contains a reference to the PDDocument
to which it belongs? In other words, does a PDPage
has knowledge of its PDDocument
?Unfortunately the
PDPage
does not contain a reference to its parentPDDocument
, but it has a list of all other pages in the document that can be used to navigate between pages without a reference to the parentPDDocument
.
PDPage
object, can you get information from it like its page number, or can you get this via another way?There is a workaround to get information about the position of a
PDPage
in the document without thePDDocument
available. EachPDPage
has a dictionary with information about the size of the page, resources, fonts, content, etc. One of these attributes is called Parent, this is an array of Pages that have all the information needed to create a shallow clone of thePDPage
using the constructorPDPage(COSDictionary)
. The pages are in the correct order so the page number can be obtain by the position of the record in the array.
PDDocuments
in the list, is there a way to know to which PDDocument
a page originally belonged?Once you merge the document list into a single document all references to the original documents will be lost. You can confirm this by looking at the Parent object inside the
PDPage
, go to Parent > Kids > COSObject[n] > Parent and see if the number for Parent is the same for all the elements in the array. In this example Parent isCOSName {Parent} : 1781256139;
for all pages.
COSName {Parent} : COSObject {
COSDictionary {
COSName {Kids} : COSArray {
COSObject {
COSDictionary {
COSName {TrimBox} : COSArray {0; 0; 612; 792;};
COSName {MediaBox} : COSArray {0; 0; 612; 792;};
COSName {CropBox} : COSArray {0; 0; 612; 792;};
COSName {Resources} : COSDictionary {
...
};
COSName {Contents} : COSObject {
...
};
COSName {Parent} : 1781256139;
COSName {StructParents} : COSInt {68};
COSName {ArtBox} : COSArray {0; 0; 612; 792; };
COSName {BleedBox} : COSArray {0; 0; 612; 792; };
COSName {Type} : COSName {Page};
}
}
...
COSName {Count} : COSInt {4};
COSName {Type} : COSName {Pages};
}
};
Source code
I wrote the following code to show how the information from the PDPage
dictionary can be used to navigate the pages back and forward and get the page number using the position in the array.
public class PDPageUtils {
public static void main(String[] args) throws InvalidPasswordException, IOException {
System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
PDDocument document = null;
try {
String filename = "src/main/resources/pdf/us-017.pdf";
document = PDDocument.load(new File(filename));
System.out.println("listIterator(PDPage)");
ListIterator<PDPage> pageIterator = listIterator(document.getPage(0));
while (pageIterator.hasNext()) {
PDPage page = pageIterator.next();
System.out.println("page #: " + pageIterator.nextIndex() + ", Structural Parent Key: " + page.getStructParents());
}
} finally {
if (document != null) {
document.close();
}
}
}
/**
* Returns a <code>ListIterator</code> initialized with the list of pages from
* the dictionary embedded in the specified <code>PDPage</code>. The current
* position of this <code>ListIterator</code> is set to the position of the
* specified <code>PDPage</code>.
*
* @param page the specified <code>PDPage</code>
*
* @see {@link java.util.ListIterator}
* @see {@link org.apache.pdfbox.pdmodel.PDPage}
*/
public static ListIterator<PDPage> listIterator(PDPage page) {
List<PDPage> pages = new LinkedList<PDPage>();
COSDictionary pageDictionary = page.getCOSObject();
COSDictionary parentDictionary = pageDictionary.getCOSDictionary(COSName.PARENT);
COSArray kidsArray = parentDictionary.getCOSArray(COSName.KIDS);
List<? extends COSBase> kidList = kidsArray.toList();
for (COSBase kid : kidList) {
if (kid instanceof COSObject) {
COSObject kidObject = (COSObject) kid;
COSBase type = kidObject.getDictionaryObject(COSName.TYPE);
if (type == COSName.PAGE) {
COSBase kidPageBase = kidObject.getObject();
if (kidPageBase instanceof COSDictionary) {
COSDictionary kidPageDictionary = (COSDictionary) kidPageBase;
pages.add(new PDPage(kidPageDictionary));
}
}
}
}
int index = pages.indexOf(page);
return pages.listIterator(index);
}
}
Sample output
In this example the PDF document has 4 pages and the iterator was initialized with the first page. Notice that the page number is the previousIndex()
System.out.println("listIterator(PDPage)");
ListIterator<PDPage> pageIterator = listIterator(document.getPage(0));
while (pageIterator.hasNext()) {
PDPage page = pageIterator.next();
System.out.println("page #: " + pageIterator.previousIndex() + ", Structural Parent Key: " + page.getStructParents());
}
listIterator(PDPage) page #: 0, Structural Parent Key: 68 page #: 1, Structural Parent Key: 69 page #: 2, Structural Parent Key: 70 page #: 3, Structural Parent Key: 71
You can also navigate backwards by starting from the last page. Notice now that the page number is the nextIndex()
.
ListIterator<PDPage> pageIterator = listIterator(document.getPage(3));
pageIterator.next();
while (pageIterator.hasPrevious()) {
PDPage page = pageIterator.previous();
System.out.println("page #: " + pageIterator.nextIndex() + ", Structural Parent Key: " + page.getStructParents());
}
listIterator(PDPage) page #: 3, Structural Parent Key: 71 page #: 2, Structural Parent Key: 70 page #: 1, Structural Parent Key: 69 page #: 0, Structural Parent Key: 68
Upvotes: 3