Reputation: 42033
I'm writing a script to calculate the MD5 sum of an image excluding the EXIF tag.
In order to do this accurately, I need to know where the EXIF tag is located in the file (beginning, middle, end) so that I can exclude it.
How can I determine where in the file the tag is located?
The images that I am scanning are in the format TIFF, JPG, PNG, BMP, DNG, CR2, NEF, and some videos MOV, AVI, and MPG.
Upvotes: 29
Views: 11167
Reputation: 5771
Starting with Version 12.58, Mar. 15, 2023, exiftool has the ability to generate a MD5, SHA256, or SHA512 hash of the image data which ignores the embedded metadata.
The hash can be generated for JPEG, TIFF, PNG, CRW, CR3, MRW, RAF, X3F, IIQ, JP2, JXL, HEIC and AVIF images, MOV/MP4 videos, and some RIFF-based files such as AVI, WAV and WEBP.
The tag name is ImageDataHash
(originally called ImageDataMD5
) and the hash algorithm can be changed with the -API ImageHashType
option. To avoid performance issues, it is only generated if requested on the command line.
Exiftool also provides a tag to store the hash value and type in the file with the XMP-et:OriginalImageHash
and XMP-et:OriginalImageHashType
tags.
Example usage:
C:\>exiftool -G1 -a -s -ImageDataHash -API ImageHashType=SHA256 file.jpg
[File] ImageDataHash : 75ddf11303d38a5ae89f2f96172713f75296c28e50df18da8e9a3615797fab12
C:\>exiftool -P -overwrite_original -API ImageHashType=SHA256 -OriginalImageHashType=SHA256 "-OriginalImageHash<ImageDataHash" file.jpg
1 image files updated
C:\>exiftool -G1 -a -s -xmp-et:all file.jpg
[XMP-et] OriginalImageHash : 75ddf11303d38a5ae89f2f96172713f75296c28e50df18da8e9a3615797fab12
[XMP-et] OriginalImageHashType : SHA256
Upvotes: 0
Reputation: 43495
It is much easier to use the Python Imaging Library to extract the picture data (example in iPython):
In [1]: import Image
In [2]: import hashlib
In [3]: im = Image.open('foo.jpg')
In [4]: hashlib.md5(im.tobytes()).hexdigest()
Out[4]: '171e2774b2549bbe0e18ed6dcafd04d5'
This works on any type of image that PIL can handle. The tobytes
method returns the a string containing the pixel data.
BTW, the MD5 hash is now seen as pretty weak. Better to use SHA512:
In [6]: hashlib.sha512(im.tobytes()).hexdigest()
Out[6]: '6361f4a2722f221b277f81af508c9c1d0385d293a12958e2c56a57edf03da16f4e5b715582feef3db31200db67146a4b52ec3a8c445decfc2759975a98969c34'
On my machine, calculating the MD5 checksum for a 2500x1600 JPEG takes around 0.07 seconds. Using SHA512, it takes 0,10 seconds. Complete example:
#!/usr/bin/env python3
from PIL import Image
import hashlib
import sys
im = Image.open(sys.argv[1])
print(hashlib.sha512(im.tobytes()).hexdigest(), end="")
For movies, you can extract frames from them with e.g. ffmpeg, and then process them as shown above.
Upvotes: 24
Reputation: 9943
You can use stream which is part of the ImageMagick suite:
$ stream -map rgb -storage-type short image.tif - | sha256sum
d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64 -
or
$ sha256sum <(stream -map rgb -storage-type short image.tif -)
d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64 /dev/fd/63
This example is for a TIFF file which is RGB with 16 bits per sample (i.e. 48 bits per pixel). So I use map to rgb
and a short
storage-type (you can use char
here if the RGB values are 8-bits).
This method reports the same signature
hash that the verbose Imagemagick identify
command reports:
$ identify -verbose image.tif | grep signature
signature: d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64
(for ImageMagick v6.x; the hash reported by identify
on version 7 is different to that obtained using stream
, but the latter may be reproduced by any tool capable of extracting the raw bitmap data - such as dcraw
for some image types.)
Upvotes: 4
Reputation: 3000
I would use a metadata stripper to preprocess your hashing :
From ImageMagick package you have ...
mogrify -strip blah.jpg
and if you do
identify -list format
it apparently works with all the cited formats.
Upvotes: 1
Reputation: 32497
One simple way to do it is to hash the core image data. For PNG, you could do this by counting only the "critical chunks" (i.e. the ones starting with capital letters). JPEG has a similar but simpler file structure.
The visual hash in ImageMagick decompresses the image as it hashes it. In your case, you could hash the compressed image data right away, so (if implemented correctly) a it should be just as quick as hashing the raw file.
This is a small Python script illustrating the idea. It may or may not work for you, but it should at least give an indication to what I mean :)
import struct
import os
import hashlib
def png(fh):
hash = hashlib.md5()
assert fh.read(8)[1:4] == "PNG"
while True:
try:
length, = struct.unpack(">i",fh.read(4))
except struct.error:
break
if fh.read(4) == "IDAT":
hash.update(fh.read(length))
fh.read(4) # CRC
else:
fh.seek(length+4,os.SEEK_CUR)
print "Hash: %r" % hash.digest()
def jpeg(fh):
hash = hashlib.md5()
assert fh.read(2) == "\xff\xd8"
while True:
marker,length = struct.unpack(">2H", fh.read(4))
assert marker & 0xff00 == 0xff00
if marker == 0xFFDA: # Start of stream
hash.update(fh.read())
break
else:
fh.seek(length-2, os.SEEK_CUR)
print "Hash: %r" % hash.digest()
if __name__ == '__main__':
png(file("sample.png"))
jpeg(file("sample.jpg"))
Upvotes: 8