Sivaram Chintalapudi
Sivaram Chintalapudi

Reputation: 466

Why MD5 hash values are different for two excel files which appear the same

I have two excel files saved at different locations. One is directly downloaded from the browser and another is downloaded using selenium driver. I manually checked both the files, both are exactly the same. But the MD5 hash value generated for both files are coming different. How to fix this issue.

Upvotes: 5

Views: 9602

Answers (4)

ptmalcolm
ptmalcolm

Reputation: 656

Old post, new perspective:

TL;DR - The zip specification includes a timestamp. See: the wikipedia entry on zip. The following sequence will answer the question "Do my two spreadsheets actually contain the same data?"

unzip file1.xlsx -d dir1/
unzip file2.xlsx -d dir2/
diff -rq dir1/ dir2/

If the diff command at the end comes up empty, your spreadsheets are the same, despite the different hashes of the two different files.

The accepted answer from alvonellos is correct about hashing. MD5 hashing will, almost certainly, give you different results for files that differ in any way, even by a single bit. In the 10 years since the original question, MD5 hashes have been deprecated in favor of more cryptographic-secure hashes but they are still generally fine for the OP's use case -- validating the files on your local filesystem. Accidental collisions are one-in-several-hundred million, depending on input, and importantly, files that are very similar but not identical are more likely to have different hashes. In other words, crafting two files that have the same hash is actually difficult to do, and requires making very specific changes in many places throughout one of the files. If you don't trust MD5, you can use any flavor of SHA or other hashing algorithm and you'll get similar results.

Deep dive into .XLSX:

The .xlsx format is just a .zip format under the hood. You can use the linux unzip utility to de-compress an .xlsx:

unzip file.xlsx -d dir/

Previous responses suggest calculating a diff on the two files, but have not described the best way to do this. Well, once you have used unzip on the .xlsx file, you will then have a directory structure with the "guts" of your spreadsheet:

dir/
    [Content_Types].xml
    _rels/
          .rels
    xl/
          workbook.xml
          worksheets/
                      sheet1.xml
                      sheet2.xml
                      sheet3.xml
 . . .

Once you have done this two two different spreadsheets, say file1.xlsx expanded to dir1/ and file2.xlsx expanded to dir2/, you can do a recursive diff on the two files:

diff -rq dir1/ dir2/ # <-- The -rq flags mean recursive, file-name-only

Note that, if what you really want to know is whether the two files have different content, then this command will answer the question. If there is no output from this command, then there is no difference in content between the directories. IE, there is no difference between the two original spreadsheets' content.

If you are curious about the differences in the .xlsx files themselves, you can dig into the bits of the headers with the linux xxd utility:

xxd file1.xlsx | head -n1 # <-- look at the first line (16 bytes)
00000000: 504b 0304 1400 0000 0800 acab a354 7d3d  PK...........T}=

xxd file2.xlsx | head -n1
00000000: 504b 0304 1400 0000 0800 66ac a354 7d3d  PK........f..T}=

The time-stamp shows up at in the sixth octet (In this example, acab, and 66ac respectively.). The date is in the seventh octet (In this example, a354 for both).

Keep in mind than a .XLSX file is just a .ZIP file with a set of directories and files that follow Microsoft's standard zipped up inside of it. Each one of the contained files will have its own CRC-32 hash.

The eighth and ninth octets contains a CRC-32 hash that is generated by whatever zip utility generated the file. So, if you have the xxd utility handy, you can skip all the unzipping steps mentioned above and simply do:

xxd -s 14 -l 4 file1.xlsx
xxd -s 14 -l 4 file2.xlsx

With output that will look something like this:

xxd -s 14 -l 4 file1.xlsx
0000000e: 7d3d 6d31                                }=m1

xxd -s 14 -l 4 file2.xlsx
0000000e: 7d3d 6d31                                }=m1

thus confirming that the two internal files have the same hashes (regardless of timestamp). This is a handy check for the very first file contained within the .ZIP (ie .XLSX).

For an exhaustive view of the CRC32 hashes of the contents of all of the files contained within the .XLSX archive, you can use the following algorithm (pseudocode):

bytes := "504b0102" as binary

chunks_array := file_content split on bytes

crc32_hashes = []

crc32_hashes := 
  for each chunk in chunks_array starting at 1: // skip index 0
    append substring(24, 8) of hex(chunk)  // 24 / 2 = 12 + 4 = offset 16

The magic number 504b0102 at the top is the separator for the file summaries at the end of the .ZIP file, flipped for endian-ness.

The resulting crc32_hashes array contains the CRC-32 hashes of each of the files contained therein. Again, because of internal timestamp mechanisms and other implementation-specific metadata in the internal XMLs, the human-readable parts of a spreadsheet could be identical but the CRC-32 hashes different.

Nevertheless, this is an inexpensive way to get a "fingerprint" of two Excel files to find out if they are in fact two copies of the exact same .XLSX. It's just string manipulation, which is much less time and processor intensive than re-hashing. It is relying on the hashing that was already done at the moment the .XLSX file was created.

Upvotes: 0

Eli Hatch
Eli Hatch

Reputation: 1

Are you sure metadata is included in the hash? It would be wise to do some research on this.

If this was true you would never find matching hashes because the likelihood of timestamps matching would be very low, you can also change the filename a thousand times and the hash will be the same. Also when an AV scans a file it changes the accessed timestamp properties, you're hashes would be changing constantly if metadata was included in the hash on a machine that is constantly being scanned by an AV.

Upvotes: 0

alvonellos
alvonellos

Reputation: 1062

MD5 is a hashing function. People use hashing functions to verify the integrity of a file, stream, or other resource. When it comes to hashing functions, when you're verifying the integrity of a file, you're verifying that at the bit level, the files are the same.

The ramifications of this are that when you're comparing a file with integrity constraints on the bitwise level, then a hashing function works perfectly.

However, given the nature of Excel spreadsheets. If so much as one bit is added, removed, or moved from the document, on the bitwise level, then the hash of that file will be completely different. (Not always, but don't worry about that.)

Since the driver for Excel is quite different from the driver that selenium uses, especially given compression and other alterations/optimizations that may be made to the file by selenium, then -- of course -- the hash is going to be different.

My recommendations: Firstly: Pull up the file in diff and find out what is different between those two files. It's almost (but not quite) axiomatic that if the hashes for two files are different, then those files are also different.

Secondly: write a driver that compares the information in those spreadsheets to verify integrity (and you can take hashes of that information) of the document, rather than verifying the files on a bitwise level.

I'd recommend exporting both as a CSV and go line by line and compare the two.

Upvotes: 4

Neozaru
Neozaru

Reputation: 1130

MD5 algorithm computes the file entierely, including metadata (filename, dates, etc) which are stored into the file, so two files can be identical in "main content" but different in some bytes.

It could be hard to determine which part of the file is really interesting for MD5 check.

Try this kind of tools if you are on Windows, and interested only on Excel files : http://www.formulasoft.com/download.html

Upvotes: 3

Related Questions