ensnare
ensnare

Reputation: 42093

Efficiently recurse through directory of files while minimizing memory usage in Python

I have a large collection of files that I'd like to recurse through and perform an md5 checksum.

Many of these files are stored on multiple physical disks but all mounted in the same directory:

/mnt/drive1/dir1/file.jpg
/mnt/drive2/dir1/file2.jpg    

How can I recurse through /mnt without loading the entire directory and file structure into memory?

Is there a way to do this with multiple threads? It might not be necessary to recurse through the directories using multiple threads/processes, but the file operations can be CPU intensive which would benefit from multiple CPU cores.

Thanks in advance.

Upvotes: 2

Views: 2544

Answers (4)

jaime
jaime

Reputation: 2344

import multiprocessing
import os.path
import hashlib
import sys


VALID_EXTENSIONS = ('.JPG', '.GIF', '.JPEG')
MAX_FILE_SZ = 1000000


def md5_file(fname):
    try:
        with open(fname) as fo:
            m = hashlib.md5()
            chunk_sz = m.block_size * 128
            data = fo.read(chunk_sz)
            while data:
                m.update(data)
                data = fo.read(chunk_sz)
        md5_file.queue.put((fname, m.hexdigest()))
    except IOError:
        md5_file.queue.put((fname, None))


def is_valid_file(fname):
    ext = os.path.splitext(fname)[1].upper()
    fsz = os.path.getsize(fname)
    return ext in VALID_EXTENSIONS and fsz <= MAX_FILE_SZ


def init(queue):
    md5_file.queue = queue


def main():
    # Holds tuple (fname, md5sum) / md5sum will be none if an IOError occurs
    queue = multiprocessing.Queue()
    pool = multiprocessing.Pool(None, init, [queue])

    for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
        # Convert filenames to full paths...
        full_path_fnames = map(lambda fn: os.path.join(dirpath, fn), 
                               filenames)
        full_path_fnames = filter(is_valid_file, full_path_fnames)
        pool.map(md5_file, full_path_fnames)

    # Dump the queue
    while not queue.empty():
        print queue.get()
    return 0

if __name__ == '__main__':
    sys.exit(main())

May not be bulletproof, but it works for me. You'll probably want to tweak it to provide some feedback as to what it is doing.

For some odd reason, you can't share a global queue. So, I had to use the pool's initializer function. I'm not sure why this is the case.

Just pass the root directory to process as the sole argument, and it will dump out md5 sums when it is finished.

Upvotes: 1

Sneftel
Sneftel

Reputation: 41503

Expanding on my comment a little bit, this code creates a pool of processes (size given on the command line), which opens each file under the current directory, zips it 50 times, and computes the CRC of that. It then XORs all the CRCs together and prints it.

import multiprocessing
import os
import sys
import zlib

NUM_PROCS = int(sys.argv[1])

def processFile(filepath):
    infile = open(filepath, 'r')
    contents = infile.read()
    for i in xrange(50):
        contents = zlib.compress(contents)
    return zlib.crc32(contents)

def generateFilepaths():
    for (dirpath, dirnames, filenames) in os.walk('.'):
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            yield filepath

if __name__ == '__main__':
    pool = multiprocessing.Pool(NUM_PROCS)

    fullCrc = 0
    for crc in pool.imap_unordered(processFile, generateFilepaths()):
        fullCrc ^= crc

    print fullCrc

Note that without doing such an absurd amount of work on each file, this program would still be thoroughly IO-bound. It's fairly likely, therefore, that threading will gain you very little speed.

Upvotes: 2

Dennis Sakva
Dennis Sakva

Reputation: 1467

Why not use simple os.walk? It's not taking any significant amounts of memory.

import os
for root, dirs, files in os.walk('/mnt'):
    for name in files:
        print os.path.join(root, name)

Upvotes: 2

Pavel
Pavel

Reputation: 7562

Not sure if there's a universally correct answer, but perhaps you want to start with something simple, benchmark and see what are the bottlenecks.

Option 1

Generate a file list with find /mnt -type f and pass it to the script. This can be perfectly parallelized using multiple worker threads (multiprocessing), but you might want to split the list according to the physical devices so that each one is processed by one thread. This is probably important if the connected storages are slow.

Option 2

Generate a list of top-level directories (e.g. max-depth 1 or 2) and let the threads operate on each partition separately. This can be done with Linux' find -typde d and/or Python's os.walk().


I wouldn't bother about loading too much into memory as long as you don't know exactly how bad it is. That is, before you benchmark. Also, I don't think that os.walk or a file list is going to be a serious memory issue. Correct me if I'm wrong.

Upvotes: 0

Related Questions