JOE SKEET
JOE SKEET

Reputation: 8118

Python - Acquiring a count of file extensions across all directories

we have a hardrive with hundreds of thousands of files

i need to figure out how many of every file extension we have

how can i do this with python?

i need it to go through every directory. this lawyers at my company need this. it can be a total for the entire hardrive it does not have to be broken down by directory

example:

1232 JPEG
11 exe
45 bat
2342 avi
532 doc

Upvotes: 3

Views: 2149

Answers (5)

Senthil Kumaran
Senthil Kumaran

Reputation: 56941

Have a look at os.walk call in the os module and traverse through the entire directory tree. Get the extension using os.path.splitext. Maintain a dictionary where key the extension.lower() and increment the count of each extension that you encounter.

import os
import collections
extensions = collections.defaultdict(int)

for path, dirs, files in os.walk('/'):
   for filename in files:
       extensions[os.path.splitext(filename)[1].lower()] += 1

for key,value in extensions.items():
    print 'Extension: ', key, ' ', value, ' items'

Upvotes: 10

karlcow
karlcow

Reputation: 6972

The pattern is simple.

counter = 0 
for root, dirs, files in os.walk(YourPath):
    for file in files:    
        if file.endswith(EXTENSION):
            counter += 1

You can create an array with the list of EXTENSION and add them. Another faster way would be to create a dictionary growing little by little. The extension is then a key for adding values. {jpeg: 1232, exe: 11}

Update: With many of the solutions we propose is that we assume that the string is a correct representation of the filetype. But I'm not sure there is any other way of doing it. The iteration should be done only once indeed as the comment says below. So it is better to grow the dictionary little by little

Upvotes: 1

chmullig
chmullig

Reputation: 13416

Use os.walk() to go through the files, and os.path.splitext() to get just the extensions. You may want to lower() the extensions too, because at least in my $HOME I have a bunch of .jpg and a bunch of .JPG.

import os, os.path, collections
extensionCount = collections.defaultdict(int)
for root, dirs, files in os.walk('.'):
    for file in files:
        base, ext = os.path.splitext(file)
        extensionCount[ext.lower()] += 1
#Now print them out, largest to smallest.
for ext, count in sorted(extensionCount.items(), key=lambda x: x[1], reverse=True):
    print ext, count

Upvotes: 2

Rafe Kettler
Rafe Kettler

Reputation: 76965

import os
from os.path import splitext

extensions = {}
for root, dir, files in os.walk('/'):
    for file in files:
        ext = splitext(file)[1]
        try:
            extensions[ext] += 1
        except KeyError:
            extensions[ext] = 1

You'd probably be better served by using a DefaultDict, you can use that if you like.

You can then print the values like so:

for extension, count in extensions.items():
    print 'Extension %s has %d files' % (extension, count)

Upvotes: 2

Piotr Duda
Piotr Duda

Reputation: 1815

The working script will be very simple and I recommend you use os.walk() function. What it does is generates file names across directory tree( http://docs.python.org/library/os.html).

Upvotes: 0

Related Questions