Reputation: 121

"black stain" when extracting page to image on PDFBox 2.0.4

Using PDFBox 2.0.4 to extract pages as image, my result page contains multiple "black holes" as shown in the following screen :

This happen only for this PDF and few others : http://www.filedropper.com/selection_3

Here is a simple code (with JavaFX) to reproduce the problem (change the File path after downloading the PDF) :

public class PDFExtractionTest extends Application {

    @Override
    public void start(Stage primaryStage) throws Exception {
        FileInputStream inputStream = new FileInputStream(new File("C:\\Users\\John\\Desktop\\selection.pdf"));
        PDDocument document = PDDocument.load(inputStream);
        PDFRenderer pdfRenderer = new PDFRenderer(document);
        BufferedImage bufferedImage = pdfRenderer.renderImage(1);
        Image fxImage = SwingFXUtils.toFXImage(bufferedImage, null);

        BorderPane borderPane = new BorderPane();
        ImageView imageView = new ImageView(fxImage);

        borderPane.setCenter(imageView);

        primaryStage.setScene(new Scene(borderPane, 1024, 768));
        primaryStage.show();
    }

     public static void main(String[] args) throws FileNotFoundException {
         launch(args);
     }
}

Here are my dependencies :

pdfbox 2.0.4
jai-imageio-jpeg2000 1.3.0 (Prevent error : Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed)
levigo-jbig2-imageio 1.6.5 (Prevent error : Cannot read JBIG2 image: jbig2-imageio is not installed)

In the logs I have this, but I don't know if it's the cause of the problem. How can I fix it ?

févr. 01, 2017 11:20:51 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
AVERTISSEMENT: No Unicode mapping for .notdef (9) in font Times-Bold
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.rendering.Type1Glyph2D getPathForCharacterCode
AVERTISSEMENT: No glyph for 9 (.notdef) in font Times-Bold
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
AVERTISSEMENT: No Unicode mapping for .notdef (9) in font Helvetica
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.rendering.Type1Glyph2D getPathForCharacterCode
AVERTISSEMENT: No glyph for 9 (.notdef) in font Helvetica

Did I miss something in the code or should I report a bug ?

Upvotes: 8

Answers (2)

Lonzak

Reputation: 9816

After 13 reminders I got Stian to finally release a new version 1.4.0 of the jai-imageio-jpeg2000 library.

So this thing can finally be solved by upgrading to the latest official library...

Upvotes: 2

Tilman Hausherr

Reputation: 18906

This is a longtime problem (see PDFBOX-1752). The bug is in JAI, not in PDFBox. The "No unicode..." is irrelevant here, this is only relevant for text extraction.

Check out the jai-imageio-jpeg2000 project, then change the file StdEntropyDecoder.java as in this commit (expanded from this pull request). Build the project and either reference version 1.3.1-SNAPSHOT in your maven pom.xml or copy the jar file into your classpath.

If the jai-imageio-jpeg2000 project team releases a new version that contains that pull request, then you'll no longer have to build yourself.

Additional keywords: black inkblot, black splodge

Upvotes: 3

&quot;black stain&quot; when extracting page to image on PDFBox 2.0.4

Answers (2)

Related Questions

"black stain" when extracting page to image on PDFBox 2.0.4