Reputation: 29715
I have written a small crawler to scan and resort directory structures.
It based on dirent(which is a small wrapper around FindNextFileA) In my first benchmarks it is surprisingy slow:
around 123473ms for 4500 files(thinkpad t60p local samsung 320 GB 2.5" HD). 121481 files found in 123473 milliseconds Is this speed normal?
This is my code:
int testPrintDir(std::string strDir, std::string strPattern="*", bool recurse=true){
struct dirent *ent;
DIR *dir;
dir = opendir (strDir.c_str());
int retVal = 0;
if (dir != NULL) {
while ((ent = readdir (dir)) != NULL) {
if (strcmp(ent->d_name, ".") !=0 && strcmp(ent->d_name, "..") !=0){
std::string strFullName = strDir +"\\"+std::string(ent->d_name);
std::string strType = "N/A";
bool isDir = (ent->data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) !=0;
strType = (isDir)?"DIR":"FILE";
if ((!isDir)){
//printf ("%s <%s>\n", strFullName.c_str(),strType.c_str());//ent->d_name);
retVal++;
}
if (isDir && recurse){
retVal += testPrintDir(strFullName, strPattern, recurse);
}
}
}
closedir (dir);
return retVal;
} else {
/* could not open directory */
perror ("DIR NOT FOUND!");
return -1;
}
}
Upvotes: 7
Views: 4155
Reputation: 106549
There are some circumstances where such a speed is normal yes. First, using FindFirstFileA instead of FindFirstFileW is going to incur overhead for the conversion from UTF-16 to ANSI. Second, if you are going through directories that have not yet been accessed by the operating system, you will incur at least one seek penalty (about 16ms for most consumer hard drives), limiting your enumeration to well under 100 directory checks per second. This will get worse if the Master File Table on the given drive is badly fragmented.
Regarding number of files, it's going to depend more upon the number of files per directory than the number of files themselves.
Upvotes: 6
Reputation: 4442
Probably the drive is the bottleneck. But you can try:
alloca
).FindFirstFileW
and FindNextFileW
may be a little faster.Upvotes: 2
Reputation: 1
You're holding DIR handles open during the recursive dive. Instead, keep a local list of directories that you've encountered, and process them outside that loop, after the closedir()
.
Upvotes: 0
Reputation: 54614
The very first time you ever do a recursive directory crawl, you should probably enumerate the entire current directory first and queue up any directories you find to visit when you are done. That way, you are likely to take advantage of any immediate read-ahead optimizations NTFS might do.
On subsequent recursive directory crawls, if the metadata for the directories fits in the system cache, it doesn't matter how you do it.
EDIT: clarifying how you should visit directories. It isn't technically a breadth first search.
Upvotes: 4