Reputation: 56199
I am very new to Python and have question. How in Python check if two files ( String and file ) have same content ? I need to download some stuffs and rename, but I don't want to save same stuff with two or more different names (same stuff can be on different ip addresses ).
Upvotes: 2
Views: 4936
Reputation: 683
While hashes and checksums are great for comparing a list of files, if you are only comparing two specific files and don't have a pre-computed hash/checksum, then it is faster to compare the two files directly than it is to compute a hash/checksum for each and compare the hash/checksum
def equalsFile(firstFile, secondFile, blocksize=65536):
buf1 = firstFile.read(blocksize)
buf2 = secondFile.read(blocksize)
while len(buf1) > 0:
if buf1!=buf2:
return False
buf1, buf2 = firstFile.read(blocksize), secondFile.read(blocksize)
return True
In my tests, 64 md5 checks on two 50MB files complete in 24.468 seconds, while 64 direct comparisons complete in just 4.770 seconds. This method also has the advantage of instantly returning false upon finding any difference, while calculating the hash must continue to read the entire file.
An additional way to create an early-fail on files that aren't identical is to just check their sizes before running the above test using os.path.getsize(filename)
. This size difference is very common when checking equality of two files with different content, and so should always be the first thing you check.
import os
if os.path.getSize('file1.txt') != os.path.getSize('file2.txt'):
print 'false'
else:
print equalsFile(open('file1.txt', 'rb'), open('file1.txt', 'rb'))
Upvotes: 1
Reputation: 172249
For each file you download make a hash or a checksum. Keep a list of these hashes/checksums.
Then before saving the downloaded data to disk, check if the hash/checksum already exists in the list, and if it does, don't save it, but if it doesn't, save the file and add the checksum/hash to the list.
Pseudocode:
checksums = []
for url in all_urls:
data = download_file(url)
checksum = make_checksum(data)
if checksum not in checksums:
save_to_file(data)
checksums.append(checksum)
Upvotes: 1
Reputation: 176780
It is not necessary to use a hash if all you want is a checksum. Python has a checksum in the binascii module.
binascii.crc32(data[, crc])
Upvotes: 2
Reputation: 3716
If the file is large, I would consider reading it in chunks like this:
compare.py:
import hashlib
teststr = "foo"
filename = "file.txt"
def md5_for_file(f, block_size=2**20):
md5 = hashlib.md5()
while True:
data = f.read(block_size)
if not data:
break
md5.update(data.encode('utf8'))
return md5.digest()
md5 = hashlib.md5()
md5.update((teststr + "\n").encode('utf8'))
digest = md5.digest()
f = open(filename, 'r')
print(md5_for_file(f) == digest)
file.txt:
foo
This program prints True if the string and file match
Upvotes: 5
Reputation: 1997
The best way is to get some hash (i.e. md5) and compare it.
Here you can read how to get md5 of file.
Upvotes: 1
Reputation: 49013
Use sha1 hash of file content.
#!/usr/bin/env python
from __future__ import with_statement
from __future__ import print_function
from hashlib import sha1
def shafile(filename):
with open(filename, "rb") as f:
return sha1(f.read()).hexdigest()
if __name__ == '__main__':
import sys
import glob
globber = (filename for arg in sys.argv[1:] for filename in glob.glob(arg))
for filename in globber:
print(filename, shafile(filename))
This program takes wildcards on the command line, but it is just for demonstration purposes.
Upvotes: 5