lemiant
lemiant

Reputation: 4365

Why is os.mkdir() slower when called explicitly?

I have been working a project which has to create a large directory structure. My first solution was to keep a dict of all the dirs which exist and if it comes across one which has not been made use os.makedirs() to create it and any missing intermediaries. When I profiled this code I found that a huge majority of the time (105 out of 132 seconds) was spent calling posix.stat() to determine that intermediary directories did not exist. However I am building this entire structure in an empty directory so I already knew that none of the intermediate directories would exist.

In order to leverage this I wrote a version of the code which keeps and internal memo describing the structure of the directory tree so that it can determine which directories have been created without querying the os:

class DirTree:
  def __init__(self, root):
    self.root = os.path.abspath(root)
    self.tree = {}
  def makedirs(self, path):
    relpath = os.path.relpath(path, self.root).replace('\\', '/')
    built = self.root
    node = self.tree
    for directory in relpath.split('/'):
        built = os.path.join(built, directory)
        if directory in node:
            node = node[directory]
        else:
            node[directory] = {}
            node = node[directory]
            os.mkdir(built, 0777)

This code does run faster, however when I run it through the profiler the same 4068 calls to os.mkdir() now take 4 times longer (94s instead of 24s). It don't understand why this function takes longer when it's called form my function than when it is called by os.makedirs(). Anybody have an idea why?

Upvotes: 3

Views: 1822

Answers (1)

waTeim
waTeim

Reputation: 9235

You are right that os.mkdirs checks the existence of a path component before making the directory see here, line 136. Both your code and os.mkdirs make use of the c-python module posixmodule.c for the actual implementation for mkdir which on linux resolves to the system call mkdir.

It looks like os.mkdir really stats unnecessarily given that stat is so time consuming, because if "a" doesn't exist, then certainly "a/b" doesn't exist either.

Using strace it can be seen that both implementations call mkdir the same number of times, but when the path is relative the function you created constructs the absolute path anyway versus os.mkdirs which uses the relative path.

A possibility is that the extra time is the OS searching through the directory structure to find the right directory instead of adding every time to "."

os.mkdirs

stat("a/b/c", 0x7fff34b1c4d0)           = -1 ENOENT (No such file or directory)
stat("a/b", 0x7fff34b1c260)             = -1 ENOENT (No such file or directory)
stat("a", 0x7fff34b1bff0)               = -1 ENOENT (No such file or directory)
mkdir("a", 0777)                        = 0
mkdir("a/b", 0777)                      = 0
mkdir("a/b/c", 0777)                    = 0
mkdir("a/b/c/d", 0777)                  = 0

modified mkdirs

mkdir("/tmp/a", 0777)                   = 0
mkdir("/tmp/a/b", 0777)                 = 0
mkdir("/tmp/a/b/c", 0777)               = 0
mkdir("/tmp/a/b/c/d", 0777)             = 0

That being said, I could not reproduce your results. I found that the time spent (using cProfile) mkdir invoked by os.mkdirs or by your source is about the same

os.mkdirs

 4003    0.132    0.000    0.132    0.000 {posix.mkdir}

modified mkdirs

 4003    0.147    0.000    0.147    0.000 {posix.mkdir}

but there was a large amount of time spent in the new source in posixpath

 4000    0.104    0.000    1.003    0.000 posixpath.py:400(relpath)

Perhaps this is an artifact of the method to profile or subtlety of installation.

Upvotes: 1

Related Questions