Reputation: 738
How can I efficiently calculate the size of every subfolder and file in a given directory?
The code I have so far does what I want, but it is inefficient and slow because of how I have to calculate the parent folder size.
Here's my current timing:
Section 1: 0.53 s
Section 2: 30.71 s
Code:
import os
import time
import collections
def folder_size(directory):
parents = []
file_size = collections.defaultdict(int)
parent_size = collections.defaultdict(int)
t0 = time.time()
#### Section 1 ####
for root, dirs, files in os.walk(directory):
root = os.path.abspath(root)
parents.append(root)
for f in files:
f = os.path.join(root, f)
file_size[f] += os.path.getsize(f)
###################
t1 = time.time()
print(f'walk time: {round(t1-t0, 2)}')
#### Section 2 ####
for parent in parents:
parent_split = parent.split(os.sep)
for filename, value in file_size.items():
parent_for_file = filename.split(os.sep)[:len(parent_split)]
if parent_split == parent_for_file:
parent_size[parent] += value
###################
t2 = time.time()
print(f'parent size time: {round(t2-t1, 2)}')
return file_size, parent_size
Section 2 of the code is inefficient for a couple reasons:
Inefficiency #1
I need to capture folders where there are no files. For example, in a folder structure like this:
TopFolder
├── FolderA
│ ├── folder_P1
│ │ ├── folder_P1__file_1.txt
│ │ └── folder_P1__file_2.txt
│ ├── folder_P10
│ │ ├── folder_P10__file_1.txt
│ │ └── folder_P10__file_2.txt
.
.
.
I want to end up with a size (in bytes) for each directory, like this:
'..../TopFolder': 114000,
'..../TopFolder/FolderA': 38000,
'..../TopFolder/FolderA/folder_P1': 38,
'..../TopFolder/FolderA/folder_P10': 38,
.
.
.
In order to get the total size for folders that have subfolders, like TopFolder
and FolderA
, I stored the parents separately, so I could go back and calculate their size based on the file sizes.
Inefficiency #2
The code is really slow because I have to split()
the strings to determine the parent (confirmed with the cProfile
module). I have to do this because if I do something like the snippet below, certain folder sizes will be calculated incorrectly. I also tried using re.split()
, but that's even slower.
#### Section 2 ####
...
for parent in parents:
for filename, value in file_size.items():
if parent in filename:
parent_size[parent] += value
...
###################
Here's the wrong output with if parent in filename
:
'..../TopFolder': 114000,
'..../TopFolder/FolderA': 38000,
'..../TopFolder/FolderA/folder_P1': 4256,
'..../TopFolder/FolderA/folder_P10': 456,
'..../TopFolder/FolderA/folder_P100': 76,
'..../TopFolder/FolderA/folder_P1000': 38,
.
.
.
Here's the correct output with the original code:
'..../TopFolder': 114000,
'..../TopFolder/FolderA': 38000,
'..../TopFolder/FolderA/folder_P1': 38,
'..../TopFolder/FolderA/folder_P10': 38,
'..../TopFolder/FolderA/folder_P100': 38,
'..../TopFolder/FolderA/folder_P1000': 38,
.
.
.
Section 2 either needs to be improved so it runs faster, or Section 2 needs to be incorporated into Section 1. I've searched the internet for ideas, but have only been able to find info on calculating the top level directory size and am running out of ideas.
Here's the code I used to create a sample directory structure:
import os
folder = 'TopFolder'
subfolders = ['FolderA', 'FolderB', 'FolderC']
for i in range(1000):
for subfolder in subfolders:
path = os.path.join(folder, subfolder, f'folder_P{i + 1}')
if not os.path.isdir(path):
os.makedirs(path)
for k in range(2):
with open(os.path.join(path, f'folder_P{i + 1}__file_{k + 1}.txt'), 'w') as file_out:
file_out.write(f'Hello from file {k + 1}!\n')```
Upvotes: 2
Views: 873
Reputation: 107124
With os.walk
you don't get to use the file entry objects generated by os.scandir
, which os.walk
calls internally. Write a recursive function yourself with os.scandir
, so you can use the stat
object of each file entry rather than having to make a separate system call with os.path.getsize
for each file. You should also not parse the path just to look for a parent directory name since you already have the parent directory name when you list a directory with that name.
The following example takes only 0.2 seconds to produce the desired output for your test directory structure on repl.it:
import os
def folder_size(directory):
def _folder_size(directory):
total = 0
for entry in os.scandir(directory):
if entry.is_dir():
_folder_size(entry.path)
total += parent_size[entry.path]
else:
size = entry.stat().st_size
total += size
file_size[entry.path] = size
parent_size[directory] = total
file_size = {}
parent_size = {}
_folder_size(directory)
return file_size, parent_size
file_size, parent_size = folder_size('TopFolder')
Demo: https://replit.com/@blhsing/SparseStainedNature
Upvotes: 3