Working on many files with Python

Question

The task:

I am working with 4 TB of data/files, stored on an external usb disk: images, html, videos, executables and so on.

I want to index all those files in a sqlite3 database with the following schema:

path TEXT, mimetype TEXT, filetype TEXT, size INT

So far:

I os.walk recursively through the mounted directory, execute the linux file command with python's subprocess and get the size with os.path.getsize(). Finally the results are written into the database, stored on my computer - the usb is mounted with -o ro, of course. No threading, by the way

You can see the full code here http://hub.darcs.net/ampoffcom/smtid/browse/smtid.py

The problem:

The code is really slow. I realized that the deeper the direcory structure, the slower the code. I suppose, os.walk might be a problem.

The questions:

Is there a faster alternative to os.walk?
Would threading fasten things up?

abarnert · Accepted Answer

Is there a faster alternative to os.walk?

Yes. In fact, multiple.

scandir (which will be in the stdlib in 3.5) is significantly faster than walk.
The C function fts is significantly faster than scandir. I'm pretty sure there are wrappers on PyPI, although I don't know one off-hand to recommend, and it's not that hard to use via ctypes or cffi if you know any C.
The find tool uses fts, and you can always subprocess to it if you can't use fts directly.

Would threading fasten things up?

That depends on details your system that we don't have, but… You're spending all of your time waiting on the filesystem. Unless you have multiple independent drives that are only bound together at user-level (that is, not LVM or something below it like RAID) or not at all (e.g., one is just mounted under the other's filesystem), issuing multiple requests in parallel will probably not speed things up.

Still, this is pretty easy to test; why not try it and see?

One more idea: you may be spending a lot of time spawning and communicating with those file processes. There are multiple Python libraries that use the same libmagic that it does. I don't want to recommend one in particular over the others, so here's search results.

As monkut suggests, make sure you're doing bulk commits, not autocommitting each insert with sqlite. As the FAQ explains, sqlite can do ~50000 inserts per second, but only a few dozen transactions per second.

While we're at it, if you can put the sqlite file on a different filesystem than the one you're scanning (or keep it in memory until you're done, then write it to disk all at once), that might be worth trying.

Finally, but most importantly:

Profile your code to see where the hotspots are, instead of guessing.
Create small data sets and benchmark different alternatives to see how much benefit you get.

Working on many files with Python

Answers (1)

Related Questions