Reputation: 25
Python 2.7.5 Win/Mac.
I'm trying to find the best way to search files (over 10000) on multiple storages (about 128Tio). These files have specific extensions and I can ignore some folders.
Here is my first function with os.listdir
and recursion :
count = 0
def SearchFiles1(path):
global count
pathList = os.listdir(path)
for i in pathList:
subPath = path+os.path.sep+i
if os.path.isfile(subPath) == True :
fileName = os.path.basename(subPath)
extension = fileName[fileName.rfind("."):]
if ".ext1" in extension or ".ext2" in extension or ".ext3" in extension:
count += 1
#do stuff . . .
else :
if os.path.isdir(subPath) == True:
if not "UselessFolder1" in subPath and not "UselessFolder1" in subPath:
SearchFiles1(subPath)
It works, but i think it could be better (faster and proper) or am I wrong?
So I tried os.path.walk
:
def SearchFiles2(path):
count = 0
for dirpath, subdirs, files in os.walk(path):
for i in dirpath:
if not "UselessFolder1" in i and not "UselessFolder1" in i:
for y in files:
fileName = os.path.basename(y)
extension = fileName[fileName.rfind("."):]
if ".ext2" in extension or ".ext2" in extension or ".ext3" in extension:
count += 1
# do stuff . . .
return count
"count" is wrong and a way slower. And I think I don't really understand how path.walk
works.
My question is: what can I do to optimize this research?
Upvotes: 0
Views: 334
Reputation: 25
So after tests and discussion with tdelaney I optimized both solutions as follow:
import os
count = 0
target_files = set((".ext1", ".ext2", ".ext3")) # etc
useless_dirs = set(("UselessFolder2", "UselessFolder2")) # etc
# it could be target_dirs, just change `in` with `not in` when compared.
def SearchFiles1(path):
global count
pathList = os.listdir(path)
for content in pathList:
fullPath = os.path.join(path,content)
if os.path.isfile(fullPath):
if os.path.splitext(fullPath)[1] in target_files:
count += 1
#do stuff with 'fullPath' . . .
else :
if os.path.isdir(fullPath):
if fullPath not in useless_dirs:
SearchFiles1(fullPath)
def SearchFiles2(path):
count = 0
for dirpath, subdirs, files in os.walk(path):
for name in set(subdirs) & useless_dirs:
subdirs.remove(name)
for filename in [name for name in files if os.path.splitext(name)[1] in target_files]:
count += 1
fullPath = os.path.join(dirpath, filename)
#do stuff with 'fullPath' . . .
return count
It works fine on Mac/PC v2.7.5
About speed it's totally even.
Upvotes: 0
Reputation: 77347
Your first solution is reasonable except that you could use os.path.splitext
. In the second solution, its incorrect because you revisit the files list for each subdir instead of just processing it once. With os.path.walk
the trick is that directories removed from subdirs
are not part of the next round of enumerations.
def SearchFiles2(path):
useless_dirs = set(("UselessFolder1", "UselessFolder2"))
useless_files = set((".ext1", ".ext2"))
count = 0
for dirpath, subdirs, files in os.walk(path):
# remove unwanted subdirs from future enumeration
for name in set(subdirs) & useless_dir:
subdirs.remove(name)
# list of interesting files
myfiles = [os.path.join(dirpath, name) for name in files
if os.path.splitext(name)[1] not in useless_files]
count += len(myfiles)
for filepath in myfiles:
# example shows file stats
print(filepath, os.stat(filepath)
return count
Enumerating the file system of a single storage unit can only go so fast. The best way to speed this up is to run the enumeration of different storage units in different threads.
Upvotes: 1