Reputation: 115
Using Python 2.7 and scandir, I need to traverse down all directories and subdirectories and return a list of the directories ONLY. Not the files. The depth of subdirectories in along a path may vary.
I am aware of os.walk, but my directory has 2 million files and therefore os.walk is to slow for this.
Currently the code below works for me, but I suspect there may be an easier way/loop to achieve the same result, and I'd like to know how it can be improved. Also the limitation of my function is that it is still limited by the depth I can traverse into the sub-directories, and perhaps this can be overcome.
def list_directories(path):
dir_list = []
for entry in scandir(path):
if entry.is_dir():
dir_list.append(entry.path)
for entry2 in scandir(entry.path):
if entry2.is_dir():
dir_list.append(entry2.path)
for entry3 in scandir(entry2.path):
if entry3.is_dir():
dir_list.append(entry3.path)
for entry4 in scandir(entry3.path):
if entry4.is_dir():
dir_list.append(entry4.path)
for entry5 in scandir(entry4.path):
if entry5.is_dir():
dir_list.append(entry5.path)
for entry6 in scandir(entry5.path):
if entry6.is_dir():
dir_list.append(entry6.path)
return dir_list
for item in filelist_dir(directory):
print item
Please let me know if you have a better alternative to quickly returning all directories and sub-directories in a path that has millions of files.
Upvotes: 1
Views: 4224
Reputation: 8898
scandir supports a walk() function that includes the same optimizations of scandir() so it should be faster than os.walk(). (scandir's background section suggests a 3-10 time improvement on Linux/Mac OS X.)
So you could just use that... Something like this code might work:
from scandir import walk
def list_directories(path):
dir_list = []
for root, _, _ in walk(path):
# Skip the top-level directory, same as in your original code:
if root == path:
continue
dir_list.append(root)
return dir_list
If you want to implement this using scandir() instead, in order to implement something that supports an arbitrary depth you should use recursion.
Something like:
from scandir import scandir
def list_directories(path):
dir_list = []
for entry in scandir(path):
if entry.is_dir() and not entry.is_symlink():
dir_list.append(entry.path)
dir_list.extend(list_directories(entry.path))
return dir_list
NOTE: I added a check for is_symlink() too, so it doesn't traverse symlinks. Otherwise a symlink pointing to '.' or '..' would make this recurse forever...
I still think using scandir.walk() is better (simpler, more reliable), so if that suits you, use that instead!
Upvotes: 4
Reputation: 365717
First, to avoid that limit of 6 directories, you probably want to do this recursively:
def list_directories(path):
dir_list = []
for entry in scandir(path):
if entry.is_dir():
dir_list.append(entry.path)
dir_list.extend(list_directories(entry.path))
Also, since you're using Python 2.7, part of the reason os.walk
is too slow is that Python 2.7 uses listdir
instead of scandir
for walk
. The scandir
backport package includes its own implementation of walk
(basically the same one used in Python 3.5) that provides the same API as walk
but with a major speedup (especially on Windows).
Beyond that, your main performance cost probably depends on the platform.
On Windows, it's mainly the cost to read the directory entries. And there's really not much you can do about it; scandir
is already doing this the fastest way.
On POSIX, it's probably mainly the cost to stat
each file to see if it's a directory. You can speed this up a bit by using fts
(especially on Linux) But as far as I know, there are no good Python wrappers for it. If you know ctypes
, it's not that complicated to just call it; the hard part is coming up with a good design for exposing all of its features to Python (which, of course, you don't need to do). See my unfinished library on GitHub if you want to try this yourself.
Alternatively, you might want to use find
(which uses fts
under the covers), either driving it via subprocess
, or having it drive your script.
Finally, you may want to do things in parallel. This may actually slow things down instead of speeding things up if your filesystem is, say, an old laptop hard drive as opposed to, say, two SSDs and a RAID stripe with high-end controllers. So definitely try it and see before committing too much.
If you're doing non-trivial work, a single walk thread queuing up directories for your worker(s) to work on is probably all you need.
If walking is the whole point, you'll want to pull multiple walkers in parallel. The way concurrent.futures.ThreadPoolExecutor
wraps things up may just be good enough out of the box, and it's dead simple. For maximum speed, you may need to manually queue things up and pull them in batches, shard the work by physical volume, etc., but probably none of that will be necessary. (If it is, and if you can muddle through reading Rust code, ripgrep
put a lot of work into navigating a filesystem as quickly as possible.)
Upvotes: 0
Reputation: 701
You can use python builtin module os.walk
:
for root, dirs, files in os.walk(".", topdown=False):
for name in files:
print(os.path.join(root, name))
for name in dirs:
#this will get your all directories within the path
print(os.path.join(root, name))
For more info visit this link : os.walk
Upvotes: -1