Reputation: 121
Using PDFBox 2.0.4 to extract pages as image, my result page contains multiple "black holes" as shown in the following screen :
This happen only for this PDF and few others : http://www.filedropper.com/selection_3
Here is a simple code (with JavaFX) to reproduce the problem (change the File path after downloading the PDF) :
public class PDFExtractionTest extends Application {
@Override
public void start(Stage primaryStage) throws Exception {
FileInputStream inputStream = new FileInputStream(new File("C:\\Users\\John\\Desktop\\selection.pdf"));
PDDocument document = PDDocument.load(inputStream);
PDFRenderer pdfRenderer = new PDFRenderer(document);
BufferedImage bufferedImage = pdfRenderer.renderImage(1);
Image fxImage = SwingFXUtils.toFXImage(bufferedImage, null);
BorderPane borderPane = new BorderPane();
ImageView imageView = new ImageView(fxImage);
borderPane.setCenter(imageView);
primaryStage.setScene(new Scene(borderPane, 1024, 768));
primaryStage.show();
}
public static void main(String[] args) throws FileNotFoundException {
launch(args);
}
}
Here are my dependencies :
In the logs I have this, but I don't know if it's the cause of the problem. How can I fix it ?
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
AVERTISSEMENT: No Unicode mapping for .notdef (9) in font Times-Bold
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.rendering.Type1Glyph2D getPathForCharacterCode
AVERTISSEMENT: No glyph for 9 (.notdef) in font Times-Bold
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
AVERTISSEMENT: No Unicode mapping for .notdef (9) in font Helvetica
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.rendering.Type1Glyph2D getPathForCharacterCode
AVERTISSEMENT: No glyph for 9 (.notdef) in font Helvetica
Did I miss something in the code or should I report a bug ?
Upvotes: 8
Views: 1633
Reputation: 9816
After 13 reminders I got Stian to finally release a new version 1.4.0 of the jai-imageio-jpeg2000 library.
So this thing can finally be solved by upgrading to the latest official library...
Upvotes: 2
Reputation: 18906
This is a longtime problem (see PDFBOX-1752). The bug is in JAI, not in PDFBox. The "No unicode..." is irrelevant here, this is only relevant for text extraction.
Check out the jai-imageio-jpeg2000 project, then change the file StdEntropyDecoder.java
as in this commit (expanded from this pull request). Build the project and either reference version 1.3.1-SNAPSHOT in your maven pom.xml or copy the jar file into your classpath.
If the jai-imageio-jpeg2000 project team releases a new version that contains that pull request, then you'll no longer have to build yourself.
Additional keywords: black inkblot, black splodge
Upvotes: 3