Reputation: 176
I am processing some large pdf files, (up to 100MB and about 2000 pages), with pdfbox. Some of the pages contain a QR code, I want to split those files into smaller ones with the pages from one QR code to the next. I got this, but the result file sizes are the same as the source file. I mean, if I cut a 100MB pdf file into a ten files I am getting ten files 100MB each.
This is the code:
PDDocument documentoPdf =
PDDocument.loadNonSeq(new File("myFile.pdf"),
new RandomAccessFile(new File("./tmp/temp"), "rw"));
int numPages = documentoPdf.getNumberOfPages();
List pages = documentoPdf.getDocumentCatalog().getAllPages();
int previusQR = 0;
for(int i =0; i<numPages; i++){
PDPage page = (PDPage) pages.get(i);
BufferedImage firstPageImage =
page.convertToImage(BufferedImage.TYPE_USHORT_565_RGB , 200);
String qrText = readQRWithQRCodeMultiReader(firstPageImage, hintMap);
if(qrText != null and i!=0){
PDDocument outputDocument = new PDDocument();
for(int j = previusQR; j<i; j++){
outputDocument.importPage((PDPage)pages.get(j));
}
File f = new File("./splitting_files/"+previusQR+".pdf");
outputDocument.save(f);
outputDocument.close();
documentoPdf.close();
}
I also tried the following code for storing the new file:
PDDocument outputDocument = new PDDocument();
for(int j = previusQR; j<i; j++){
PDStream src = ((PDPage)pages.get(j)).getContents();
PDStream streamD = new PDStream(outputDocument);
streamD.addCompression();
PDPage newPage = new PDPage(new
COSDictionary(((PDPage)pages.get(j)).getCOSDictionary()));
newPage.setContents(streamD);
byte[] buf = new byte[10240];
int amountRead = 0;
InputStream is = null;
OutputStream os = null;
is = src.createInputStream();
os = streamD.createOutputStream();
while((amountRead = is.read(buf,0,10240)) > -1) {
os.write(buf, 0, amountRead);
}
outputDocument.addPage(newPage);
}
File f = new File("./splitting_files/"+previusQR+".pdf");
outputDocument.save(f);
outputDocument.close();
But this code creates files which lacks some content and also have the same size than the original.
How can I create smaller pdfs files from a larger one? Is it posible with PDFBox? Is there any other library with which I can transform a single page into an image (for qr recognition), and also allows me to split a big pdf file into smaller ones?
Thx!
Upvotes: 3
Views: 4774
Reputation: 176
Thx! Tilman you are right, the PDFSplit command generates smaller files. I checked the PDFSplit code out and found that it removes the page links to avoid not needed resources.
Code extracted from Splitter.class :
private void processAnnotations(PDPage imported) throws IOException
{
List<PDAnnotation> annotations = imported.getAnnotations();
for (PDAnnotation annotation : annotations)
{
if (annotation instanceof PDAnnotationLink)
{
PDAnnotationLink link = (PDAnnotationLink)annotation;
PDDestination destination = link.getDestination();
if (destination == null && link.getAction() != null)
{
PDAction action = link.getAction();
if (action instanceof PDActionGoTo)
{
destination = ((PDActionGoTo)action).getDestination();
}
}
if (destination instanceof PDPageDestination)
{
// TODO preserve links to pages within the splitted result
((PDPageDestination) destination).setPage(null);
}
}
else
{
// TODO preserve links to pages within the splitted result
annotation.setPage(null);
}
}
}
So eventually my code looks like this:
PDDocument documentoPdf =
PDDocument.loadNonSeq(new File("docs_compuestos/50.pdf"), new RandomAccessFile(new File("./tmp/t"), "rw"));
int numPages = documentoPdf.getNumberOfPages();
List pages = documentoPdf.getDocumentCatalog().getAllPages();
int previusQR = 0;
for(int i =0; i<numPages; i++){
PDPage firstPage = (PDPage) pages.get(i);
String qrText ="";
BufferedImage firstPageImage = firstPage.convertToImage(BufferedImage.TYPE_USHORT_565_RGB , 200);
firstPage =null;
try {
qrText = readQRWithQRCodeMultiReader(firstPageImage, hintMap);
} catch (NotFoundException e) {
e.printStackTrace();
} finally {
firstPageImage = null;
}
if(i != 0 && qrText!=null){
PDDocument outputDocument = new PDDocument();
outputDocument.setDocumentInformation(documentoPdf.getDocumentInformation());
outputDocument.getDocumentCatalog().setViewerPreferences(
documentoPdf.getDocumentCatalog().getViewerPreferences());
for(int j = previusQR; j<i; j++){
PDPage importedPage = outputDocument.importPage((PDPage)pages.get(j));
importedPage.setCropBox( ((PDPage)pages.get(j)).findCropBox() );
importedPage.setMediaBox( ((PDPage)pages.get(j)).findMediaBox() );
// only the resources of the page will be copied
importedPage.setResources( ((PDPage)pages.get(j)).getResources() );
importedPage.setRotation( ((PDPage)pages.get(j)).findRotation() );
processAnnotations(importedPage);
}
File f = new File("./splitting_files/"+previusQR+".pdf");
previusQR = i;
outputDocument.save(f);
outputDocument.close();
}
}
}
Thank you very much!!
Upvotes: 3