Reputation: 733
The task:
I am working with 4 TB of data/files, stored on an external usb disk: images, html, videos, executables and so on.
I want to index all those files in a sqlite3 database with the following schema:
path TEXT, mimetype TEXT, filetype TEXT, size INT
So far:
I os.walk recursively through the mounted directory, execute the linux file
command with python's subprocess and get the size with os.path.getsize(). Finally the results are written into the database, stored on my computer - the usb is mounted with -o ro, of course. No threading, by the way
You can see the full code here http://hub.darcs.net/ampoffcom/smtid/browse/smtid.py
The problem:
The code is really slow. I realized that the deeper the direcory structure, the slower the code. I suppose, os.walk might be a problem.
The questions:
Upvotes: 1
Views: 250
Reputation: 366103
Is there a faster alternative to
os.walk
?
Yes. In fact, multiple.
scandir
(which will be in the stdlib in 3.5) is significantly faster than walk
.fts
is significantly faster than scandir
. I'm pretty sure there are wrappers on PyPI, although I don't know one off-hand to recommend, and it's not that hard to use via ctypes
or cffi
if you know any C.find
tool uses fts
, and you can always subprocess
to it if you can't use fts
directly.Would threading fasten things up?
That depends on details your system that we don't have, but… You're spending all of your time waiting on the filesystem. Unless you have multiple independent drives that are only bound together at user-level (that is, not LVM or something below it like RAID) or not at all (e.g., one is just mounted under the other's filesystem), issuing multiple requests in parallel will probably not speed things up.
Still, this is pretty easy to test; why not try it and see?
One more idea: you may be spending a lot of time spawning and communicating with those file
processes. There are multiple Python libraries that use the same libmagic
that it does. I don't want to recommend one in particular over the others, so here's search results.
As monkut suggests, make sure you're doing bulk commits, not autocommitting each insert with sqlite. As the FAQ explains, sqlite can do ~50000 inserts per second, but only a few dozen transactions per second.
While we're at it, if you can put the sqlite file on a different filesystem than the one you're scanning (or keep it in memory until you're done, then write it to disk all at once), that might be worth trying.
Finally, but most importantly:
Upvotes: 6