Billy ONeal
Billy ONeal

Reputation: 106559

How can I quickly enumerate directories on Win32?

I'm trying to speedup directory enumeration in C++, where I'm recursing into subdirectories. I currently have an app which spends 95% of it's time in FindFirst/FindNextFile APIs, and it takes several minutes to enumerate all the files on a given volume. I know it's possible to do this faster because there is an app that does: Everything. It enumerates my entire drive in seconds.

How might I accomplish something like this?

Upvotes: 29

Views: 9471

Answers (7)

Michael Haephrati
Michael Haephrati

Reputation: 4225

I checked 'everything' and you can use its SDK and get similar results from your own code. https://www.voidtools.com/support/everything/sdk/c/

I also tried to write a relatively fast solution. In any case, I wanted to find the most recent file with a name that matches a given creteria.

#include <windows.h>
#include <string>
#include <vector>
#include <queue>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <iostream>
#include <functional>

#include <shlwapi.h> // Required for PathMatchSpec
#pragma comment(lib, "Shlwapi.lib") // Required to link Shlwapi.lib
// A structure to store the file path and its last write time
struct FileEntry
{
    std::wstring filePath;
    FILETIME lastWriteTime;

    bool operator<(const FileEntry& other) const
    {
        return CompareFileTime(&lastWriteTime, &other.lastWriteTime) < 0;
    }
};

// Thread pool to manage concurrent directory searches
class ThreadPool
{
public:
    ThreadPool(size_t numThreads);
    ~ThreadPool();

    void enqueue(std::function<void()> task);
    void waitForCompletion();

private:
    std::vector<std::thread> workers;
    std::queue<std::function<void()>> tasks;
    std::mutex queueMutex;
    std::condition_variable condition;
    bool stop;

    size_t activeTasks;
    std::condition_variable allTasksCompleted;
};

ThreadPool::ThreadPool(size_t numThreads) : stop(false), activeTasks(0)
{
    for (size_t i = 0; i < numThreads; ++i)
    {
        workers.emplace_back([this]
            {
                while (true)
                {
                    std::function<void()> task;

                    {
                        std::unique_lock<std::mutex> lock(this->queueMutex);
                        this->condition.wait(lock, [this] { return this->stop || !this->tasks.empty(); });

                        if (this->stop && this->tasks.empty())
                            return;

                        task = std::move(this->tasks.front());
                        this->tasks.pop();
                        ++activeTasks;
                    }

                    task();

                    {
                        std::unique_lock<std::mutex> lock(this->queueMutex);
                        --activeTasks;
                        if (activeTasks == 0 && tasks.empty())
                        {
                            allTasksCompleted.notify_all();
                        }
                    }
                }
            });
    }
}

ThreadPool::~ThreadPool()
{
    {
        std::unique_lock<std::mutex> lock(queueMutex);
        stop = true;
    }
    condition.notify_all();
    for (std::thread& worker : workers)
        worker.join();
}

void ThreadPool::enqueue(std::function<void()> task)
{
    {
        std::unique_lock<std::mutex> lock(queueMutex);
        tasks.emplace(task);
    }
    condition.notify_one();
}

void ThreadPool::waitForCompletion()
{
    std::unique_lock<std::mutex> lock(queueMutex);
    allTasksCompleted.wait(lock, [this] { return tasks.empty() && activeTasks == 0; });
}

std::priority_queue<FileEntry> fileQueue;
std::mutex resultMutex;

// Function to search for files (based on a wildcard pattern) in a given directory
void SearchDirectory(const std::wstring& directory, const std::wstring& filePattern, ThreadPool& pool)
{
    WIN32_FIND_DATA findFileData;
    HANDLE hFind = INVALID_HANDLE_VALUE;
    std::wstring searchPath = directory + L"\\*";

    hFind = FindFirstFile(searchPath.c_str(), &findFileData);
    if (hFind == INVALID_HANDLE_VALUE)
    {
        return;
    }

    do
    {
        std::wstring fileOrDir = findFileData.cFileName;

        if (fileOrDir == L"." || fileOrDir == L"..")
        {
            continue;
        }

        std::wstring fullPath = directory + L"\\" + fileOrDir;

        if (findFileData.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)
        {
            // Recursively search in subdirectories
            pool.enqueue([=, &pool] { SearchDirectory(fullPath, filePattern, pool); });
        }
        else
        {
            if (PathMatchSpec(fileOrDir.c_str(), filePattern.c_str()))
            {
                std::lock_guard<std::mutex> lock(resultMutex);
                fileQueue.push({ fullPath, findFileData.ftLastWriteTime });
            }
        }

    }
    while (FindNextFile(hFind, &findFileData) != 0);

    FindClose(hFind);
}

// Function to find the most recent file (based on wildcard creteria) using a thread pool
std::wstring FindMostRecentFile(const std::wstring& directory, const std::wstring& filePattern)
{
    ThreadPool pool(std::thread::hardware_concurrency());

    // Start the search in the initial directory
    pool.enqueue([&pool, &directory, &filePattern] { SearchDirectory(directory, filePattern, pool); });

    pool.waitForCompletion();

    std::lock_guard<std::mutex> lock(resultMutex);
    if (!fileQueue.empty())
    {
        return fileQueue.top().filePath;
    }
    else
    {
        return L"No matching file found.";
    }
}

int main()
{
    std::wstring directory = L"C:\\";
    std::wstring filePattern = L"*business*.pdf";

    std::wstring mostRecentFile = FindMostRecentFile(directory, filePattern);
    std::wcout << L"Most recent file: " << mostRecentFile << std::endl;

    return 0;
}

Upvotes: 2

Michael Burr
Michael Burr

Reputation: 340218

"Everything" accesses directory information at a lower level than the Win32 FindFirst/FindNext APIs.

I believe it reads and interprets the NTFS MFT structures directly, and that this is one of the main reasons for its performance. It's also why it requires admin privileges and why "Everything" only indexes local or removable NTFS volumes (not network drives, for example).

A couple other utilities that do the similar things are:

A little reverse engineering with a debugger on these tools might give you some insight on the techniques they use.

Upvotes: 6

S. Liu
S. Liu

Reputation: 1078

If you are doing this on NTFS, here's a lib for low level access: NTFSLib.

You can enumerate through all file records in $MFT, each representing a real file on disk. You can get all file attributes from the record, including $DATA.

This may be the fastest way to enumerate all files/directories on NTFS volumes, 200k~300k files per minute as I tested.

Upvotes: 1

Scott Dillman
Scott Dillman

Reputation: 831

I realize this is an old post, but there is a project on source forge that does exactly what you are asking and the source code is available.

You can find the project here: NTFS-Search

Upvotes: 8

peterchen
peterchen

Reputation: 41106

"Everything" builds an index in the background, so queries are against the index not the file system itself.

There are a few improvements to be made - at least over the straight-forward algorrithm:

First, breadth search over depth search. That is, enumerate and process all files in a single folder before recursing into the sub folders you found. This improves locality - usually a lot.

On Windows 7 / W2K8R2, you can use FindFirstFileEx with FindExInfoBasic, the main speedup being omitting the short file name on NTFS file systems where this is enabled.

Separate threads help if you enumerate different physical disks (not just drives). For the same disk it only helps if it's an SSD ("zero seek time"), or you spend significant time processing a file name (compared to the time spent on disk access).


[edit] Wikipedia actually has some comments - Basically, they are skipping the file system abstraction layer, and access NTFS directly. This way, they can batch calls and skip expensive services of the file system - such as checking ACL's.

A good starting point would be the NTFS Technical Reference on MSDN.

Upvotes: 7

Mark Ransom
Mark Ransom

Reputation: 308206

If you're already doing the best you can to get the maximum speed from the API, the next step is to do low-level disk accesses and bypass Windows altogether. You might get some guidance from the NTFS drivers for Linux, or perhaps you can use one directly.

Upvotes: 2

Mark Ransom
Mark Ransom

Reputation: 308206

Don't recurse immediately, save a list of directories you find and dive into them when finished. You want to do linear access to each directory, to take advantage of locality of reference and any caching the OS is doing.

Upvotes: 5

Related Questions