ensnare
ensnare

Reputation: 42033

Compute hash of only the core image data (excluding metadata) for an image

I'm writing a script to calculate the MD5 sum of an image excluding the EXIF tag.

In order to do this accurately, I need to know where the EXIF tag is located in the file (beginning, middle, end) so that I can exclude it.

How can I determine where in the file the tag is located?

The images that I am scanning are in the format TIFF, JPG, PNG, BMP, DNG, CR2, NEF, and some videos MOV, AVI, and MPG.

Upvotes: 29

Views: 11167

Answers (5)

StarGeek
StarGeek

Reputation: 5771

Starting with Version 12.58, Mar. 15, 2023, exiftool has the ability to generate a MD5, SHA256, or SHA512 hash of the image data which ignores the embedded metadata.

The hash can be generated for JPEG, TIFF, PNG, CRW, CR3, MRW, RAF, X3F, IIQ, JP2, JXL, HEIC and AVIF images, MOV/MP4 videos, and some RIFF-based files such as AVI, WAV and WEBP.

The tag name is ImageDataHash (originally called ImageDataMD5) and the hash algorithm can be changed with the -API ImageHashType option. To avoid performance issues, it is only generated if requested on the command line.

Exiftool also provides a tag to store the hash value and type in the file with the XMP-et:OriginalImageHash and XMP-et:OriginalImageHashType tags.

Example usage:

C:\>exiftool -G1 -a -s -ImageDataHash -API ImageHashType=SHA256 file.jpg 
[File]          ImageDataHash                   : 75ddf11303d38a5ae89f2f96172713f75296c28e50df18da8e9a3615797fab12

C:\>exiftool -P -overwrite_original -API ImageHashType=SHA256 -OriginalImageHashType=SHA256 "-OriginalImageHash<ImageDataHash" file.jpg 
    1 image files updated

C:\>exiftool -G1 -a -s -xmp-et:all file.jpg 
[XMP-et]        OriginalImageHash               : 75ddf11303d38a5ae89f2f96172713f75296c28e50df18da8e9a3615797fab12
[XMP-et]        OriginalImageHashType           : SHA256

Upvotes: 0

Roland Smith
Roland Smith

Reputation: 43495

It is much easier to use the Python Imaging Library to extract the picture data (example in iPython):

In [1]: import Image

In [2]: import hashlib

In [3]: im = Image.open('foo.jpg')

In [4]: hashlib.md5(im.tobytes()).hexdigest()
Out[4]: '171e2774b2549bbe0e18ed6dcafd04d5'

This works on any type of image that PIL can handle. The tobytes method returns the a string containing the pixel data.

BTW, the MD5 hash is now seen as pretty weak. Better to use SHA512:

In [6]: hashlib.sha512(im.tobytes()).hexdigest()
Out[6]: '6361f4a2722f221b277f81af508c9c1d0385d293a12958e2c56a57edf03da16f4e5b715582feef3db31200db67146a4b52ec3a8c445decfc2759975a98969c34'

On my machine, calculating the MD5 checksum for a 2500x1600 JPEG takes around 0.07 seconds. Using SHA512, it takes 0,10 seconds. Complete example:

#!/usr/bin/env python3

from PIL import Image
import hashlib
import sys

im = Image.open(sys.argv[1])
print(hashlib.sha512(im.tobytes()).hexdigest(), end="")

For movies, you can extract frames from them with e.g. ffmpeg, and then process them as shown above.

Upvotes: 24

starfry
starfry

Reputation: 9943

You can use stream which is part of the ImageMagick suite:

$ stream -map rgb -storage-type short image.tif - | sha256sum
d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64  -

or

$ sha256sum <(stream -map rgb -storage-type short image.tif -)
d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64  /dev/fd/63

This example is for a TIFF file which is RGB with 16 bits per sample (i.e. 48 bits per pixel). So I use map to rgb and a short storage-type (you can use char here if the RGB values are 8-bits).

This method reports the same signature hash that the verbose Imagemagick identify command reports:

$ identify -verbose image.tif | grep signature
signature: d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64

(for ImageMagick v6.x; the hash reported by identify on version 7 is different to that obtained using stream, but the latter may be reproduced by any tool capable of extracting the raw bitmap data - such as dcraw for some image types.)

Upvotes: 4

gbin
gbin

Reputation: 3000

I would use a metadata stripper to preprocess your hashing :

From ImageMagick package you have ...

mogrify -strip blah.jpg

and if you do

identify -list format 

it apparently works with all the cited formats.

Upvotes: 1

Krumelur
Krumelur

Reputation: 32497

One simple way to do it is to hash the core image data. For PNG, you could do this by counting only the "critical chunks" (i.e. the ones starting with capital letters). JPEG has a similar but simpler file structure.

The visual hash in ImageMagick decompresses the image as it hashes it. In your case, you could hash the compressed image data right away, so (if implemented correctly) a it should be just as quick as hashing the raw file.

This is a small Python script illustrating the idea. It may or may not work for you, but it should at least give an indication to what I mean :)

import struct
import os
import hashlib

def png(fh):
    hash = hashlib.md5()
    assert fh.read(8)[1:4] == "PNG"
    while True:
        try:
            length, = struct.unpack(">i",fh.read(4))
        except struct.error:
            break
        if fh.read(4) == "IDAT":
            hash.update(fh.read(length))
            fh.read(4) # CRC
        else:
            fh.seek(length+4,os.SEEK_CUR)
    print "Hash: %r" % hash.digest()

def jpeg(fh):
    hash = hashlib.md5()
    assert fh.read(2) == "\xff\xd8"
    while True:
        marker,length = struct.unpack(">2H", fh.read(4))
        assert marker & 0xff00 == 0xff00
        if marker == 0xFFDA: # Start of stream
            hash.update(fh.read())
            break
        else:
            fh.seek(length-2, os.SEEK_CUR)
    print "Hash: %r" % hash.digest()


if __name__ == '__main__':
    png(file("sample.png"))
    jpeg(file("sample.jpg"))

Upvotes: 8

Related Questions