Reputation: 466
I have two excel files saved at different locations. One is directly downloaded from the browser and another is downloaded using selenium driver. I manually checked both the files, both are exactly the same. But the MD5 hash value generated for both files are coming different. How to fix this issue.
Upvotes: 5
Views: 9602
Reputation: 656
Old post, new perspective:
TL;DR - The zip specification includes a timestamp. See: the wikipedia entry on zip. The following sequence will answer the question "Do my two spreadsheets actually contain the same data?"
unzip file1.xlsx -d dir1/
unzip file2.xlsx -d dir2/
diff -rq dir1/ dir2/
If the diff
command at the end comes up empty, your spreadsheets are the same, despite the different hashes of the two different files.
The accepted answer from alvonellos is correct about hashing. MD5 hashing will, almost certainly, give you different results for files that differ in any way, even by a single bit. In the 10 years since the original question, MD5 hashes have been deprecated in favor of more cryptographic-secure hashes but they are still generally fine for the OP's use case -- validating the files on your local filesystem. Accidental collisions are one-in-several-hundred million, depending on input, and importantly, files that are very similar but not identical are more likely to have different hashes. In other words, crafting two files that have the same hash is actually difficult to do, and requires making very specific changes in many places throughout one of the files. If you don't trust MD5, you can use any flavor of SHA or other hashing algorithm and you'll get similar results.
Deep dive into .XLSX:
The .xlsx format is just a .zip format under the hood. You can use the linux unzip
utility to de-compress an .xlsx:
unzip file.xlsx -d dir/
Previous responses suggest calculating a diff
on the two files, but have not described the best way to do this. Well, once you have used unzip
on the .xlsx file, you will then have a directory structure with the "guts" of your spreadsheet:
dir/
[Content_Types].xml
_rels/
.rels
xl/
workbook.xml
worksheets/
sheet1.xml
sheet2.xml
sheet3.xml
. . .
Once you have done this two two different spreadsheets, say file1.xlsx
expanded to dir1/
and file2.xlsx
expanded to dir2/
, you can do a recursive diff on the two files:
diff -rq dir1/ dir2/ # <-- The -rq flags mean recursive, file-name-only
Note that, if what you really want to know is whether the two files have different content, then this command will answer the question. If there is no output from this command, then there is no difference in content between the directories. IE, there is no difference between the two original spreadsheets' content.
If you are curious about the differences in the .xlsx files themselves, you can dig into the bits of the headers with the linux xxd
utility:
xxd file1.xlsx | head -n1 # <-- look at the first line (16 bytes)
00000000: 504b 0304 1400 0000 0800 acab a354 7d3d PK...........T}=
xxd file2.xlsx | head -n1
00000000: 504b 0304 1400 0000 0800 66ac a354 7d3d PK........f..T}=
The time-stamp shows up at in the sixth octet (In this example, acab
, and 66ac
respectively.). The date is in the seventh octet (In this example, a354
for both).
Keep in mind than a .XLSX file is just a .ZIP file with a set of directories and files that follow Microsoft's standard zipped up inside of it. Each one of the contained files will have its own CRC-32 hash.
The eighth and ninth octets contains a CRC-32 hash that is generated by whatever zip utility generated the file. So, if you have the xxd
utility handy, you can skip all the unzipping steps mentioned above and simply do:
xxd -s 14 -l 4 file1.xlsx
xxd -s 14 -l 4 file2.xlsx
With output that will look something like this:
xxd -s 14 -l 4 file1.xlsx
0000000e: 7d3d 6d31 }=m1
xxd -s 14 -l 4 file2.xlsx
0000000e: 7d3d 6d31 }=m1
thus confirming that the two internal files have the same hashes (regardless of timestamp). This is a handy check for the very first file contained within the .ZIP (ie .XLSX).
For an exhaustive view of the CRC32 hashes of the contents of all of the files contained within the .XLSX archive, you can use the following algorithm (pseudocode):
bytes := "504b0102" as binary
chunks_array := file_content split on bytes
crc32_hashes = []
crc32_hashes :=
for each chunk in chunks_array starting at 1: // skip index 0
append substring(24, 8) of hex(chunk) // 24 / 2 = 12 + 4 = offset 16
The magic number 504b0102
at the top is the separator for the file summaries at the end of the .ZIP file, flipped for endian-ness.
The resulting crc32_hashes
array contains the CRC-32 hashes of each of the files contained therein. Again, because of internal timestamp mechanisms and other implementation-specific metadata in the internal XMLs, the human-readable parts of a spreadsheet could be identical but the CRC-32 hashes different.
Nevertheless, this is an inexpensive way to get a "fingerprint" of two Excel files to find out if they are in fact two copies of the exact same .XLSX. It's just string manipulation, which is much less time and processor intensive than re-hashing. It is relying on the hashing that was already done at the moment the .XLSX file was created.
Upvotes: 0
Reputation: 1
Are you sure metadata is included in the hash? It would be wise to do some research on this.
If this was true you would never find matching hashes because the likelihood of timestamps matching would be very low, you can also change the filename a thousand times and the hash will be the same. Also when an AV scans a file it changes the accessed timestamp properties, you're hashes would be changing constantly if metadata was included in the hash on a machine that is constantly being scanned by an AV.
Upvotes: 0
Reputation: 1062
MD5 is a hashing function. People use hashing functions to verify the integrity of a file, stream, or other resource. When it comes to hashing functions, when you're verifying the integrity of a file, you're verifying that at the bit level, the files are the same.
The ramifications of this are that when you're comparing a file with integrity constraints on the bitwise level, then a hashing function works perfectly.
However, given the nature of Excel spreadsheets. If so much as one bit is added, removed, or moved from the document, on the bitwise level, then the hash of that file will be completely different. (Not always, but don't worry about that.)
Since the driver for Excel is quite different from the driver that selenium uses, especially given compression and other alterations/optimizations that may be made to the file by selenium, then -- of course -- the hash is going to be different.
My recommendations:
Firstly: Pull up the file in diff
and find out what is different between those two files. It's almost (but not quite) axiomatic that if the hashes for two files are different, then those files are also different.
Secondly: write a driver that compares the information in those spreadsheets to verify integrity (and you can take hashes of that information) of the document, rather than verifying the files on a bitwise level.
I'd recommend exporting both as a CSV and go line by line and compare the two.
Upvotes: 4
Reputation: 1130
MD5 algorithm computes the file entierely, including metadata (filename, dates, etc) which are stored into the file, so two files can be identical in "main content" but different in some bytes.
It could be hard to determine which part of the file is really interesting for MD5 check.
Try this kind of tools if you are on Windows, and interested only on Excel files : http://www.formulasoft.com/download.html
Upvotes: 3