Ozzy
Ozzy

Reputation: 10643

Check if file is JPEG, PDF or TIFF

How would i check that a file is either jpeg, pdf or tiff? And I mean actually checking, not just from mime type and file extension.

I have access to the raw file data (this check is part of an uploader) and i need to verify that the files are either jpeg, pdf or tiff. I assume I would have to check for some sort of headers in the files but I have no idea what to look for and where to start.

Upvotes: 7

Views: 3715

Answers (6)

K J
K J

Reputation: 11730

There is no sure fired way to be certain but the first few binary bits of a file are its signature/fingerprint for the file handlers to test. see https://en.wikipedia.org/wiki/List_of_file_signatures

Every file type can vary considerably and some allow for variable / shifting headers, but with a degree of uncertainty (At one time PDF did not mandate the 40 bit signature to be first) we can assume the following hex values sometimes erroneously called "Magic Numbers" as representing the start of each bit stream.

So in general to answer the requested types

  • FF D8 (ÿØ) would be a Jpeg (EXCEPT JP2000=FF 4F or 00 00) in raw binary or /9j/4 in Base64 format
  • 25 50 44 46 2d (%PDF-) would be the 40 bit signature of a PDF or JVBER in Base64 format
  • 89 50 4E 47 (‰PNG) would be PNG in raw binary or iVBOR in Base64 format

just for good measure here is related older GIF sequence

  • 47 49 46 38 (GIF8) and that's R0lGO as Base64 also we can see the first 8 bits are 01000111 for G

enter image description here

Thus in ALL the above cases just the first "8 bit / byte" would be a very good indicator, no need for Magic strings, but with Zip/###X such as docX pptX cbzX xlsX they ALL have the same Magic Number

  • 50 4B (PK) base64 = UEsDB

Finally the last requested above was Tif(f) which can be two types, Intel or Motorola thus you need to test for

  • 49 49 2A 00 (II* ) base64 = SUkqA
  • 4D 4D 00 2A (MM *) base64 = TU0AK

Upvotes: 0

Ian Atkin
Ian Atkin

Reputation: 6356

You need to implement byte sequence tests.

Here is a guide to checking byte sequences for the most common image formats.

Upvotes: 1

dethtron5000
dethtron5000

Reputation: 10841

Exif_imagetype is very useful for this: https://www.php.net/manual/en/function.exif-imagetype.php

It scans the initial bytes of the file to determine the graphic type. It supports a large number of graphic formats (and returns false if it doesn't recognize the format).

Upvotes: 3

x4rf41
x4rf41

Reputation: 5337

to check for image types you can use the exif_imagetype function. for pdf: you have to open the file and read the first bytes and look if it starts with '%PDF'

$fp = fopen($pdf, 'r');
if(fgets($fp, 4) == '%PDF')
{ 
    ... is pdf
}
fclose($fp);

Upvotes: 0

Daniel
Daniel

Reputation: 3806

This can be tricky since all files must follow a certain kind of ISO standard with the "magical number" present, which basically is a "header" for the format.

I found this wiki-page about different signatures: http://en.wikipedia.org/wiki/List_of_file_signatures

So in the best case scenario you just need to validate these first bytes.

Upvotes: 1

Ricardo Rodriguez
Ricardo Rodriguez

Reputation: 1090

If you have access to the raw file, you can check the file header for its magic number. This number define the type of file.

Upvotes: 0

Related Questions