Reputation:
I am using PdfBox in Java to extract text from PDF files. Some of the input files provided are not valid and PDFTextStripper halts on these files. Is there a clean way to check if the provided file is indeed a valid PDF?
Upvotes: 41
Views: 80585
Reputation: 843
Here is what I use into my NUnit tests, that must validate against multiple versions of PDF generated using Crystal Reports:
public static void CheckIsPDF(byte[] data)
{
Assert.IsNotNull(data);
Assert.Greater(data.Length,4);
// header
Assert.AreEqual(data[0],0x25); // %
Assert.AreEqual(data[1],0x50); // P
Assert.AreEqual(data[2],0x44); // D
Assert.AreEqual(data[3],0x46); // F
Assert.AreEqual(data[4],0x2D); // -
if(data[5]==0x31 && data[6]==0x2E && data[7]==0x33) // version is 1.3 ?
{
// file terminator
Assert.AreEqual(data[data.Length-7],0x25); // %
Assert.AreEqual(data[data.Length-6],0x25); // %
Assert.AreEqual(data[data.Length-5],0x45); // E
Assert.AreEqual(data[data.Length-4],0x4F); // O
Assert.AreEqual(data[data.Length-3],0x46); // F
Assert.AreEqual(data[data.Length-2],0x20); // SPACE
Assert.AreEqual(data[data.Length-1],0x0A); // EOL
return;
}
if (data[5] == 0x31 && data[6] == 0x2E && (
data[7] == 0x34 // version is 1.4
|| data[7] == 0x35 // version is 1.5
|| data[7] == 0x36 // version is 1.6
)) {
// file terminator
Assert.AreEqual(data[data.Length-6],0x25); // %
Assert.AreEqual(data[data.Length-5],0x25); // %
Assert.AreEqual(data[data.Length-4],0x45); // E
Assert.AreEqual(data[data.Length-3],0x4F); // O
Assert.AreEqual(data[data.Length-2],0x46); // F
Assert.AreEqual(data[data.Length-1],0x0A); // EOL
return;
}
Assert.Fail("Unsupported file format");
}
Upvotes: 31
Reputation: 51
We can user directly the below method , in which we will directly pass bytes of file data and it will return true(valid pdf) or false.
public boolean isPdf(byte[] data) {
if (data == null || data.length < 5) return false;
// %PDF-
if (data[0] == 0x25 && data[1] == 0x50 && data[2] == 0x44 && data[3] == 0x46 && data[4] == 0x2D) {
int offset = data.length - 8, count = 0; // check last 8 bytes for %%EOF with optional white-space
boolean hasSpace = false, hasCr = false, hasLf = false;
while (offset < data.length) {
if (count == 0 && data[offset] == 0x25) count++;
if (count == 1 && data[offset] == 0x25) count++;
if (count == 2 && data[offset] == 0x45) count++;
if (count == 3 && data[offset] == 0x4F) count++;
if (count == 4 && data[offset] == 0x46) count++;
// Optional flags for meta info
if (count == 5 && data[offset] == 0x20) hasSpace = true;
if (count == 5 && data[offset] == 0x0D) hasCr = true;
if (count == 5 && data[offset] == 0x0A) hasLf = true;
offset++;
}
if (count == 5) {
String version = data.length > 13 ? String.format("%s%s%s", (char) data[5], (char) data[6], (char) data[7]) : "?";
System.out.printf("Version : %s | Space : %b | CR : %b | LF : %b%n", version, hasSpace, hasCr, hasLf);
return true;
}
}
return false;
}
Upvotes: -1
Reputation: 3750
Relying on magic numbers does not really appeal to me. I ended up using a preflight library from Apache for this:
compile group: 'org.apache.pdfbox', name: 'preflight', version: '2.0.19'
private boolean isPdf(InputStream fileInputStream) {
try {
PreflightParser preflightParser = new PreflightParser(new ByteArrayDataSource(fileInputStream));
preflightParser.parse();
return true;
} catch (Exception e) {
return false;
}
}
PreflightParser has constructors for files and other data sources.
Upvotes: 3
Reputation: 48733
Here is a method that checks for the presence of %%EOF
with optional checks for white-space characters. You can pass in either a File
or a byte[]
object. There is less restriction for white-space characters in some PDF versions.
public boolean isPdf(byte[] data) {
if (data == null || data.length < 5) return false;
// %PDF-
if (data[0] == 0x25 && data[1] == 0x50 && data[2] == 0x44 && data[3] == 0x46 && data[4] == 0x2D) {
int offset = data.length - 8, count = 0; // check last 8 bytes for %%EOF with optional white-space
boolean hasSpace = false, hasCr = false, hasLf = false;
while (offset < data.length) {
if (count == 0 && data[offset] == 0x25) count++; // %
if (count == 1 && data[offset] == 0x25) count++; // %
if (count == 2 && data[offset] == 0x45) count++; // E
if (count == 3 && data[offset] == 0x4F) count++; // O
if (count == 4 && data[offset] == 0x46) count++; // F
// Optional flags for meta info
if (count == 5 && data[offset] == 0x20) hasSpace = true; // SPACE
if (count == 5 && data[offset] == 0x0D) hasCr = true; // CR
if (count == 5 && data[offset] == 0x0A) hasLf = true; // LF / EOL
offset++;
}
if (count == 5) {
String version = data.length > 13 ? String.format("%s%s%s", (char) data[5], (char) data[6], (char) data[7]) : "?";
System.out.printf("Version : %s | Space : %b | CR : %b | LF : %b%n", version, hasSpace, hasCr, hasLf);
return true;
}
}
return false;
}
public boolean isPdf(File file) throws IOException {
return isPdf(file, false);
}
// With version: 16 bytes, without version: 13 bytes.
public boolean isPdf(File file, boolean includeVersion) throws IOException {
if (file == null) return false;
int offsetStart = includeVersion ? 8 : 5, offsetEnd = 8;
byte[] bytes = new byte[offsetStart + offsetEnd];
InputStream is = new FileInputStream(file);
try {
is.read(bytes, 0, offsetStart); // %PDF-
is.skip(file.length() - bytes.length); // Skip bytes
is.read(bytes, offsetStart, offsetEnd); // %%EOF,SP?,CR?,LF?
} finally {
is.close();
}
return isPdf(bytes);
}
Upvotes: 1
Reputation: 17
In general, we can like this, any pdf version going to finish with %%EOF so we can check like bellow.
public static boolean is_pdf(byte[] data) {
String s = new String(data);
String d = s.substring(data.length - 7, data.length - 1);
if (data != null && data.length > 4 &&
data[0] == 0x25 && // %
data[1] == 0x50 && // P
data[2] == 0x44 && // D
data[3] == 0x46 && // F
data[4] == 0x2D) { // -
if(d.contains("%%EOF")){
return true;
}
}
return false;
}
Upvotes: 0
Reputation: 2454
The answer by Roger Keays is wrong! since not all PDF files in version 1.3 and not all terminated by EOL. The answer below works for all not corrupted pdf files:
public static boolean is_pdf(byte[] data) {
if (data != null && data.length > 4
&& data[0] == 0x25 && // %
data[1] == 0x50 && // P
data[2] == 0x44 && // D
data[3] == 0x46 && // F
data[4] == 0x2D) { // -
// version 1.3 file terminator
if (//data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x33 &&
data[data.length - 7] == 0x25 && // %
data[data.length - 6] == 0x25 && // %
data[data.length - 5] == 0x45 && // E
data[data.length - 4] == 0x4F && // O
data[data.length - 3] == 0x46 && // F
data[data.length - 2] == 0x20 // SPACE
//&& data[data.length - 1] == 0x0A// EOL
) {
return true;
}
// version 1.3 file terminator
if (//data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x34 &&
data[data.length - 6] == 0x25 && // %
data[data.length - 5] == 0x25 && // %
data[data.length - 4] == 0x45 && // E
data[data.length - 3] == 0x4F && // O
data[data.length - 2] == 0x46 // F
//&& data[data.length - 1] == 0x0A // EOL
) {
return true;
}
}
return false;
}
Upvotes: 3
Reputation: 193
I was using some of the suggestions I found here and on other sites/posts for determining whether a pdf was valid or not. I purposely corrupted a pdf file, and unfortunately, many of the solutions did not detect that the file was corrupted.
Eventually, after tinkering around with different methods in the API, I tried this:
PDDocument.load(file).getPage(0).getContents().toString();
This did not throw an exception, but it did output this:
WARN [COSParser:1154] The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 171, length: 1145844, expected end position: 1146015
Personally, I wanted an exception to be thrown if the file was corrupted so I could handle it myself, but it appeared that the API I was implementing already handled them in their own way.
To get around this, I decided to try parsing the files using the class that gave the warm statement (COSParser). I found that there was a subclass, called PDFParser, which inherited a method called "setLenient", which was the key (https://pdfbox.apache.org/docs/2.0.4/javadocs/org/apache/pdfbox/pdfparser/COSParser.html).
I then implemented the following:
RandomAccessFile accessFile = new RandomAccessFile(file, "r");
PDFParser parser = new PDFParser(accessFile);
parser.setLenient(false);
parser.parse();
This threw an Exception for my corrupted file, as I wanted. Hope this helps someone out!
Upvotes: 8
Reputation: 143
Maybe I am too late to answer. But you should have a look at Tika. It uses PDFBox Parser internally to parse PDF's
You just need to import tika-app-latest*.jar
public String parseToStringExample() throws IOException, SAXException, TikaException
{
Tika tika = new Tika();
try (InputStream stream = ParsingExample.class.getResourceAsStream("test.pdf")) {
return tika.parseToString(stream); // This should return you the pdf's text
}
}
It would be a much cleaner solution . You can refer here for more details of Tika Usage : https://tika.apache.org/1.12/api/
Upvotes: 4
Reputation: 3247
Here an adapted Java version of NinjaCross's code.
/**
* Test if the data in the given byte array represents a PDF file.
*/
public static boolean is_pdf(byte[] data) {
if (data != null && data.length > 4 &&
data[0] == 0x25 && // %
data[1] == 0x50 && // P
data[2] == 0x44 && // D
data[3] == 0x46 && // F
data[4] == 0x2D) { // -
// version 1.3 file terminator
if (data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x33 &&
data[data.length - 7] == 0x25 && // %
data[data.length - 6] == 0x25 && // %
data[data.length - 5] == 0x45 && // E
data[data.length - 4] == 0x4F && // O
data[data.length - 3] == 0x46 && // F
data[data.length - 2] == 0x20 && // SPACE
data[data.length - 1] == 0x0A) { // EOL
return true;
}
// version 1.3 file terminator
if (data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x34 &&
data[data.length - 6] == 0x25 && // %
data[data.length - 5] == 0x25 && // %
data[data.length - 4] == 0x45 && // E
data[data.length - 3] == 0x4F && // O
data[data.length - 2] == 0x46 && // F
data[data.length - 1] == 0x0A) { // EOL
return true;
}
}
return false;
}
And some simple unit tests:
@Test
public void test_valid_pdf_1_3_data_is_pdf() {
assertTrue(is_pdf("%PDF-1.3 CONTENT %%EOF \n".getBytes()));
}
@Test
public void test_valid_pdf_1_4_data_is_pdf() {
assertTrue(is_pdf("%PDF-1.4 CONTENT %%EOF\n".getBytes()));
}
@Test
public void test_invalid_data_is_not_pdf() {
assertFalse(is_pdf("Hello World".getBytes()));
}
If you come up with any failing unit tests, please let me know.
Upvotes: 9
Reputation: 510
There is a very convenient and simple library for testing PDF content: https://github.com/codeborne/pdf-test
API is very simple:
import com.codeborne.pdftest.PDF;
import static com.codeborne.pdftest.PDF.*;
import static org.junit.Assert.assertThat;
public class PDFContainsTextTest {
@Test
public void canAssertThatPdfContainsText() {
PDF pdf = new PDF(new File("src/test/resources/50quickideas.pdf"));
assertThat(pdf, containsText("50 Quick Ideas to Improve your User Stories"));
}
}
Upvotes: 1
Reputation: 1030
You have to try this....
public boolean isPDF(File file){
file = new File("Demo.pdf");
Scanner input = new Scanner(new FileReader(file));
while (input.hasNextLine()) {
final String checkline = input.nextLine();
if(checkline.contains("%PDF-")) {
// a match!
return true;
}
}
return false;
}
Upvotes: 5
Reputation: 31928
Since you use PDFBox you can simply do:
PDDocument.load(file);
It'll fail with an Exception if the PDF is corrupted etc.
If it succeeds you can also check if the PDF is encrypted using .isEncrypted()
Upvotes: 13
Reputation: 15789
you can find out the mime type of a file (or byte array), so you dont dumbly rely on the extension. I do it with aperture's MimeExtractor (http://aperture.sourceforge.net/) or I saw some days ago a library just for that (http://sourceforge.net/projects/mime-util)
I use aperture to extract text from a variety of files, not only pdf, but have to tweak thinks for pdfs for example (aperture uses pdfbox, but i added another library as fallback when pdfbox fails)
Upvotes: 12
Reputation: 30888
Pdf files begin "%PDF" (open one in TextPad or similar and take a look)
Any reason you can't just read the file with a StringReader and check for this?
Upvotes: 4