Reputation: 42093
I have a large collection of files that I'd like to recurse through and perform an md5 checksum.
Many of these files are stored on multiple physical disks but all mounted in the same directory:
/mnt/drive1/dir1/file.jpg
/mnt/drive2/dir1/file2.jpg
How can I recurse through /mnt without loading the entire directory and file structure into memory?
Is there a way to do this with multiple threads? It might not be necessary to recurse through the directories using multiple threads/processes, but the file operations can be CPU intensive which would benefit from multiple CPU cores.
Thanks in advance.
Upvotes: 2
Views: 2544
Reputation: 2344
import multiprocessing
import os.path
import hashlib
import sys
VALID_EXTENSIONS = ('.JPG', '.GIF', '.JPEG')
MAX_FILE_SZ = 1000000
def md5_file(fname):
try:
with open(fname) as fo:
m = hashlib.md5()
chunk_sz = m.block_size * 128
data = fo.read(chunk_sz)
while data:
m.update(data)
data = fo.read(chunk_sz)
md5_file.queue.put((fname, m.hexdigest()))
except IOError:
md5_file.queue.put((fname, None))
def is_valid_file(fname):
ext = os.path.splitext(fname)[1].upper()
fsz = os.path.getsize(fname)
return ext in VALID_EXTENSIONS and fsz <= MAX_FILE_SZ
def init(queue):
md5_file.queue = queue
def main():
# Holds tuple (fname, md5sum) / md5sum will be none if an IOError occurs
queue = multiprocessing.Queue()
pool = multiprocessing.Pool(None, init, [queue])
for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
# Convert filenames to full paths...
full_path_fnames = map(lambda fn: os.path.join(dirpath, fn),
filenames)
full_path_fnames = filter(is_valid_file, full_path_fnames)
pool.map(md5_file, full_path_fnames)
# Dump the queue
while not queue.empty():
print queue.get()
return 0
if __name__ == '__main__':
sys.exit(main())
May not be bulletproof, but it works for me. You'll probably want to tweak it to provide some feedback as to what it is doing.
For some odd reason, you can't share a global queue. So, I had to use the pool's initializer
function. I'm not sure why this is the case.
Just pass the root directory to process as the sole argument, and it will dump out md5 sums when it is finished.
Upvotes: 1
Reputation: 41503
Expanding on my comment a little bit, this code creates a pool of processes (size given on the command line), which opens each file under the current directory, zips it 50 times, and computes the CRC of that. It then XORs all the CRCs together and prints it.
import multiprocessing
import os
import sys
import zlib
NUM_PROCS = int(sys.argv[1])
def processFile(filepath):
infile = open(filepath, 'r')
contents = infile.read()
for i in xrange(50):
contents = zlib.compress(contents)
return zlib.crc32(contents)
def generateFilepaths():
for (dirpath, dirnames, filenames) in os.walk('.'):
for filename in filenames:
filepath = os.path.join(dirpath, filename)
yield filepath
if __name__ == '__main__':
pool = multiprocessing.Pool(NUM_PROCS)
fullCrc = 0
for crc in pool.imap_unordered(processFile, generateFilepaths()):
fullCrc ^= crc
print fullCrc
Note that without doing such an absurd amount of work on each file, this program would still be thoroughly IO-bound. It's fairly likely, therefore, that threading will gain you very little speed.
Upvotes: 2
Reputation: 1467
Why not use simple os.walk? It's not taking any significant amounts of memory.
import os
for root, dirs, files in os.walk('/mnt'):
for name in files:
print os.path.join(root, name)
Upvotes: 2
Reputation: 7562
Not sure if there's a universally correct answer, but perhaps you want to start with something simple, benchmark and see what are the bottlenecks.
Option 1
Generate a file list with find /mnt -type f
and pass it to the script. This can be perfectly parallelized using multiple worker threads (multiprocessing
), but you might want to split the list according to the physical devices so that each one is processed by one thread. This is probably important if the connected storages are slow.
Option 2
Generate a list of top-level directories (e.g. max-depth 1 or 2) and let the threads operate on each partition separately. This can be done with Linux' find -typde d
and/or Python's os.walk()
.
I wouldn't bother about loading too much into memory as long as you don't know exactly how bad it is. That is, before you benchmark. Also, I don't think that os.walk
or a file list is going to be a serious memory issue. Correct me if I'm wrong.
Upvotes: 0