Reputation: 325
I tried to use PDFBox on regular .pdf
files and it worked fine.
However when I encountered a corrupted .pdf
, the code would "freeze" .. not throwing errors or something .. simply the load
or parse
function take forever to execute
Here is the corrupted file (i have zipped it so that everybody could download it), it is probably not a native pdf file but it was saved as a .pdf
extension and it is only 4 Kb.
I am not an expert at all, but I think that this is a bug with PDFBox. According to documentation, both load()
and parse()
methods are supposed to throw exceptions if they fail. However in case with my file, the code would take forever to execute and not throw exception.
I tried using only load
, one can try parse()
.. the result is the same
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class TestTest {
public static void main(String[] args) throws FileNotFoundException, IOException {
System.out.println(pdfToText("C:\\..............MYFILE.pdf"));
System.out.println("done ! ! !");
}
private static String pdfToText(String fileName) throws IOException {
PDDocument document = null;
document = PDDocument.load(new File(fileName)); // THIS TAKES FOREVER
PDFTextStripper stripper = new PDFTextStripper();
document.close();
return stripper.getText(document);
}
}
How to force this code throw an exception or stop executing if the .pdf
file is corrupted?
Thanks
Upvotes: 3
Views: 4742
Reputation: 14549
For implementing simple timeouts for 3rd party libs I often use an implementation like Apache Commons ThreadMonitor:
long timeoutInMillis = 1000;
try {
Thread monitor = ThreadMonitor.start(timeoutInMillis);
// do some work here
ThreadMonitor.stop(monitor);
} catch (InterruptedException e) {
// timed amount was reached
}
Example code is from Apache's ThreadMonitor Javadoc. I only use this when the 3rd party API does not provide some timeout mechanism, of course.
However I was forced to tweak this a bit some weeks ago, because this solution does not work well with (3rd party) code that is using Exception masking.
In particular we run into problems with c3p0 which masks all Exceptions (and in particular InterruptedException
s). Our solution was to tweak the implementation to also check the exception's cause chain for InterruptedException
s.
Upvotes: 1
Reputation: 21
Try this solution:
private static String pdfToText(String fileName) {
PDDocument document = null;
try {
document = PDDocument.load(fileName);
PDFTextStripper stripper = new PDFTextStripper();
return stripper.getText(document);
} catch (IOException e) {
System.err.println("Unable to open PDF Parser. " + e.getMessage());
return null;
} finally {
if (document != null) {
try {
document.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
Upvotes: 2