user1469729
user1469729

Reputation: 119

Fast calculation of the size and counting all files in a directory and subdirectories using python (cross platform)

How could i quickly calculate the size of a large directory while counting all the files in python cross platform, this is my current code but it is very slow on large file numbers(over 100000):

class filecounter:
    def count(self, scandir):
            global filescount
            global totalsize
            if not scandir[-1] == '/' or '\\':
                    scandir = scandir + '/'
            try:
                    for item in os.listdir(scandir):
                            if os.path.isdir(scandir + item):
                                    filecounter().count(scandir + item)
                            else:
                                    totalsize = totalsize + os.path.getsize(scandir +item)
                                    filescount = filescount + 1
            except WindowsError, IOError:
                    pass

The global is needed

Upvotes: 2

Views: 1694

Answers (2)

Sylvain Defresne
Sylvain Defresne

Reputation: 44463

If you want to write portable code for file navigation, you should consider using the functions and constants from the os module (os.path.join, os.pathsep, os.altsep, ...).

One way you can optimise your code is to remove the recursion and the global variable by using the os.walk function, but it is not going to gain you much. You're going to be limited by the speed of the I/O of your computer.

def count(directory):
    totalsize = 0
    filecount = 0
    for dirpath, dirnames, filenames in os.walk(directory):
        for filename in filenames:
            try:
                totalsize += os.path.getsize(os.path.join(dirpath, filename))
                filecount += 1
            except OSError:
                pass
    return totalsize, filecount

Most of the time is going to be spent on syscall to get the list of file in a directory, and to get the size of a particular file. You could probably use python threads to parallelise the call of os.stat (indirectly called by os.path.getsize). For once, python thread would work as they release the GIL when doing a syscall.

Upvotes: 2

Robᵩ
Robᵩ

Reputation: 168606

The documentation for os.walk has almost precisely the sample you are asking for:

# from http://docs.python.org/2/library/os.html
import os
from os.path import join, getsize
for root, dirs, files in os.walk('python/Lib/email'):
    print root, "consumes",
    print sum(getsize(join(root, name)) for name in files),
    print "bytes in", len(files), "non-directory files"
    if 'CVS' in dirs:
        dirs.remove('CVS')  # don't visit CVS directories

Changing it to meet your needs is fairly simple:

import os
from os.path import join, getsize
size = 0
count = 0
for root, dirs, files in os.walk('.'):
    size += sum(getsize(join(root, name)) for name in files)
    count += len(files)
print count, size

Upvotes: 3

Related Questions