Reputation: 2339
I have a piece of code that gathers images from a PDF and saves them in a folder using pdfBox. The images are useless because I dont know anything about them. The PDF contains section headers followed by 1-3 pictures. Is there anyway to change the program so that it will tell me which section they are coming from?
Here is the code:
public static void main(String[] args) throws IOException {
PDDocument document = null;
try {
document = PDDocument.load("C:\\Users\\564864\\Downloads\\wsh2012.pdf");
} catch (IOException ex) {
System.out.println("" + ex);
}
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
int i =1;
String name = null;
while (iter.hasNext()) {
PDPage page = (PDPage) iter.next();
PDResources resources = page.getResources();
Map pageImages = resources.getImages();
if (pageImages != null) {
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
image.write2file("C:\\Users\\564864\\Desktop\\Java\\helloworld\\images\\" + i+"");
i ++;
}
}
}
}
Upvotes: 0
Views: 188
Reputation: 3184
Unless the PDF contains additional metadata, there are no sections inside a PDF. I wrote an article on structured text (which equally applies to images) at http://www.jpedal.org/PDFblog/2012/06/extracting-structured-text-from-pdf-files/
Upvotes: 1