Peter Parker
Peter Parker

Reputation: 29715

fastest way to crawl recursive ntfs directories in C++

I have written a small crawler to scan and resort directory structures.

It based on dirent(which is a small wrapper around FindNextFileA) In my first benchmarks it is surprisingy slow:

around 123473ms for 4500 files(thinkpad t60p local samsung 320 GB 2.5" HD). 121481 files found in 123473 milliseconds Is this speed normal?

This is my code:

int testPrintDir(std::string  strDir, std::string strPattern="*", bool recurse=true){
  struct dirent *ent;
  DIR *dir;
  dir = opendir (strDir.c_str());
  int retVal = 0;
  if (dir != NULL) {
    while ((ent = readdir (dir)) != NULL) {
      if (strcmp(ent->d_name, ".") !=0 &&  strcmp(ent->d_name, "..") !=0){
        std::string strFullName = strDir +"\\"+std::string(ent->d_name);
        std::string strType = "N/A";
        bool isDir = (ent->data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) !=0;
        strType = (isDir)?"DIR":"FILE";                 
        if ((!isDir)){
             //printf ("%s <%s>\n", strFullName.c_str(),strType.c_str());//ent->d_name);
          retVal++;
        }   
        if (isDir && recurse){
             retVal += testPrintDir(strFullName, strPattern, recurse);
        }
      }
    }
    closedir (dir);
    return retVal;
  } else {
    /* could not open directory */
    perror ("DIR NOT FOUND!");
    return -1;
  }
}

Upvotes: 7

Views: 4155

Answers (4)

Billy ONeal
Billy ONeal

Reputation: 106549

There are some circumstances where such a speed is normal yes. First, using FindFirstFileA instead of FindFirstFileW is going to incur overhead for the conversion from UTF-16 to ANSI. Second, if you are going through directories that have not yet been accessed by the operating system, you will incur at least one seek penalty (about 16ms for most consumer hard drives), limiting your enumeration to well under 100 directory checks per second. This will get worse if the Master File Table on the given drive is badly fragmented.

Regarding number of files, it's going to depend more upon the number of files per directory than the number of files themselves.

Upvotes: 6

adf88
adf88

Reputation: 4442

Probably the drive is the bottleneck. But you can try:

  1. String operations can be optimized - use char arrays instęad of std::string.
  2. Building strFullName every recursive call is not necessary. Use single, fixed buffer of chars (i.e. static array inside the function), modify it instantly.
  3. Don't pass the strPattern by value!
  4. Don't create strType until debugging
  5. Others suggested to build a list of directories to process before getting deeper into recursion. To build it I suggest single static array (similarly to 2.) or to use the stack (alloca).
  6. The filesysytem use Unicode to store file names? If so, using unicode strings with FindFirstFileW and FindNextFileW may be a little faster.

Upvotes: 2

John
John

Reputation: 1

You're holding DIR handles open during the recursive dive. Instead, keep a local list of directories that you've encountered, and process them outside that loop, after the closedir().

Upvotes: 0

MSN
MSN

Reputation: 54614

The very first time you ever do a recursive directory crawl, you should probably enumerate the entire current directory first and queue up any directories you find to visit when you are done. That way, you are likely to take advantage of any immediate read-ahead optimizations NTFS might do.

On subsequent recursive directory crawls, if the metadata for the directories fits in the system cache, it doesn't matter how you do it.

EDIT: clarifying how you should visit directories. It isn't technically a breadth first search.

Upvotes: 4

Related Questions