CJ7
CJ7

Reputation: 23275

Determine whether file is a PDF in perl?

Using perl, what is the best way to determine whether a file is a PDF?

Apparently, not all PDFs start with %PDF. See the comments on this answer: https://stackoverflow.com/a/941962/327528

Upvotes: 2

Views: 816

Answers (2)

Patrick Gallot
Patrick Gallot

Reputation: 625

Detecting a PDF is not hard, but there are some corner cases to be aware of.

  1. All conforming PDFs contain a one-line header identifying the PDF specification to which the file conforms. Usually it's %PDF-1.N where N is a digit between 0 and 7.
    • The third edition of the PDF Reference has an implementation note that Acrobat viewer require only that the header appears within the first 1024 bytes of a file. (I've seen some cases where a job control prefix was added to the start of a PDF file, so '%PDF-1.' weren't the first seven bytes of the file)
    • The subsequent implementation note from the third edition (PDF 1.4) states: Acrobat viewers will also accept a header of the form: %!PS-Adobe-N.n PDF-M.m but note that this isn't part of the ISO32000:2008 (PDF 1.7) specification.
    • If the file doesn't begin immediately with %PDF-1.N, be careful because I've seen a case where a zip file containing a PDF was mistakenly identified as a PDF because that part of the embedded file wasn't compressed. so a check for the PDF file trailer is a good idea.
  2. The end of a PDF will contain a line with '%%EOF',
    • The third edition of the PDF Reference has an implementation note that Acrobat viewer requires only that the %%EOF marker appears within the last 1024 bytes of a file.
    • Two lines above the %%EOF should be the 'startxref' token and the line in between should be a number for the byte offset from the start of the file to the last cross reference table.

In sum, read in the first and last 1kb of the file into a byte buffer, check that the relevant identifying byte string tokens are approximately where they are supposed to be and if they are then you have a reasonable expectation that you have a PDF file on your hands.

Upvotes: 1

Joel
Joel

Reputation: 2035

The module PDF::Parse has method called IsaPDF which

Returns true, if the file could be parsed and is a PDF-file.

Upvotes: 0

Related Questions