Simon1
Simon1

Reputation: 738

How to efficiently calculate folder and file sizes in a directory?

How can I efficiently calculate the size of every subfolder and file in a given directory?

The code I have so far does what I want, but it is inefficient and slow because of how I have to calculate the parent folder size.

Here's my current timing:

Section 1: 0.53 s
Section 2: 30.71 s

Code:

import os
import time
import collections

def folder_size(directory):
    parents = []
    file_size = collections.defaultdict(int)
    parent_size = collections.defaultdict(int)

    t0 = time.time()

    #### Section 1 ####
    for root, dirs, files in os.walk(directory):
        root = os.path.abspath(root)
        parents.append(root)
        
        for f in files:
            f = os.path.join(root, f)
            file_size[f] += os.path.getsize(f)
    ###################

    t1 = time.time()
    print(f'walk time: {round(t1-t0, 2)}')   

    #### Section 2 ####
    for parent in parents:
        parent_split = parent.split(os.sep)
        for filename, value in file_size.items():
            parent_for_file = filename.split(os.sep)[:len(parent_split)]
            if parent_split == parent_for_file:
                parent_size[parent] += value
    ###################
    
    t2 = time.time()
    print(f'parent size time: {round(t2-t1, 2)}')   

    return file_size, parent_size

Section 2 of the code is inefficient for a couple reasons:

Inefficiency #1

I need to capture folders where there are no files. For example, in a folder structure like this:

TopFolder
├── FolderA
│   ├── folder_P1
│   │   ├── folder_P1__file_1.txt
│   │   └── folder_P1__file_2.txt
│   ├── folder_P10
│   │   ├── folder_P10__file_1.txt
│   │   └── folder_P10__file_2.txt
.
.
.

I want to end up with a size (in bytes) for each directory, like this:

'..../TopFolder': 114000,
'..../TopFolder/FolderA': 38000,
'..../TopFolder/FolderA/folder_P1': 38,
'..../TopFolder/FolderA/folder_P10': 38,
.
.
.

In order to get the total size for folders that have subfolders, like TopFolder and FolderA, I stored the parents separately, so I could go back and calculate their size based on the file sizes.

Inefficiency #2

The code is really slow because I have to split() the strings to determine the parent (confirmed with the cProfile module). I have to do this because if I do something like the snippet below, certain folder sizes will be calculated incorrectly. I also tried using re.split(), but that's even slower.

#### Section 2 ####
    ...
    for parent in parents:
        for filename, value in file_size.items():
            if parent in filename:
                parent_size[parent] += value
    ...
###################

Here's the wrong output with if parent in filename:

'..../TopFolder': 114000,
'..../TopFolder/FolderA': 38000,
'..../TopFolder/FolderA/folder_P1': 4256,
'..../TopFolder/FolderA/folder_P10': 456,
'..../TopFolder/FolderA/folder_P100': 76,
'..../TopFolder/FolderA/folder_P1000': 38,
.
.
.

Here's the correct output with the original code:

'..../TopFolder': 114000,
'..../TopFolder/FolderA': 38000,
'..../TopFolder/FolderA/folder_P1': 38,
'..../TopFolder/FolderA/folder_P10': 38,
'..../TopFolder/FolderA/folder_P100': 38,
'..../TopFolder/FolderA/folder_P1000': 38,
.
.
.

Section 2 either needs to be improved so it runs faster, or Section 2 needs to be incorporated into Section 1. I've searched the internet for ideas, but have only been able to find info on calculating the top level directory size and am running out of ideas.

Here's the code I used to create a sample directory structure:

import os

folder = 'TopFolder'
subfolders = ['FolderA', 'FolderB', 'FolderC']

for i in range(1000):
    for subfolder in subfolders:
        path = os.path.join(folder, subfolder, f'folder_P{i + 1}')
        if not os.path.isdir(path):
            os.makedirs(path)
        for k in range(2):
            with open(os.path.join(path, f'folder_P{i + 1}__file_{k + 1}.txt'), 'w') as file_out:
                file_out.write(f'Hello from file {k + 1}!\n')```

Upvotes: 2

Views: 873

Answers (1)

blhsing
blhsing

Reputation: 107124

With os.walk you don't get to use the file entry objects generated by os.scandir, which os.walk calls internally. Write a recursive function yourself with os.scandir, so you can use the stat object of each file entry rather than having to make a separate system call with os.path.getsize for each file. You should also not parse the path just to look for a parent directory name since you already have the parent directory name when you list a directory with that name.

The following example takes only 0.2 seconds to produce the desired output for your test directory structure on repl.it:

import os

def folder_size(directory):
    def _folder_size(directory):
        total = 0
        for entry in os.scandir(directory):
            if entry.is_dir():
                _folder_size(entry.path)
                total += parent_size[entry.path]
            else:
                size = entry.stat().st_size
                total += size
                file_size[entry.path] = size
        parent_size[directory] = total

    file_size = {}
    parent_size = {}
    _folder_size(directory)
    return file_size, parent_size

file_size, parent_size = folder_size('TopFolder')

Demo: https://replit.com/@blhsing/SparseStainedNature

Upvotes: 3

Related Questions