Ash
Ash

Reputation: 62096

Efficiently retrieving and filtering files

This earlier SO question talks about how to retrieve all files in a directory tree that match one of multiple extensions.

eg. Retrieve all files within C:\ and all subdirectories, matching *.log, *.txt, *.dat.

The accepted answer was this:

var files = Directory.GetFiles("C:\\path", "*.*", SearchOption.AllDirectories)
            .Where(s => s.EndsWith(".mp3") || s.EndsWith(".jpg"));

This strikes me as being quite inefficient. If you were searching on a directory tree that contains thousands of files (it uses SearchOption.AllDirectories), every single file in the specified directory tree is loaded into memory, and only then are mismatches removed. (Reminds me of the "paging" offered by ASP.NET datagrids.)

Unfortunately the standard System.IO.DirectoryInfo.GetFiles method only accepts one filter at a time.

It could be just my lack of Linq knowledge, is it actually inefficient in the way I mention?

Secondly, is there a more efficient way to do it both with and without Linq (without resorting to multiple calls to GetFiles)?

Upvotes: 3

Views: 3112

Answers (4)

Markus Olsson
Markus Olsson

Reputation: 22580

I shared your problem and I found the solution in Matthew Podwysocki's excellent post at codebetter.com.

He implemented a solution using native methods that allows you to provide a predicate into his GetFiles implementation. Additionally he implemented his solution using yield statements effectively reducing the memory utilization per file to an absolute minimum.

With his code you can write something like the following:

var allowedExtensions = new HashSet<string> { ".jpg", ".mp3" };

var files = GetFiles(
    "C:\\path", 
    SearchOption.AllDirectories, 
    fn => allowedExtensions.Contains(Path.GetExtension(fn))
);

And the files variable will point to an enumerator that returns the files matched (delayed execution style).

Upvotes: 2

Leandro L&#243;pez
Leandro L&#243;pez

Reputation: 2185

What about creating your own directory traversal function and using the C# yield operator?

EDIT: I've made a simple test, I don't know if it's exactly what you need.

class Program
{
    static string PATH = "F:\\users\\llopez\\media\\photos";

    static Func<string, bool> WHERE = s => s.EndsWith(".CR2") || s.EndsWith(".html");

    static void Main(string[] args)
    {
        using (new Profiler())
        {
            var accepted = Directory.GetFiles(PATH, "*.*", SearchOption.AllDirectories)
                .Where(WHERE);

            foreach (string f in accepted) { }
        }

        using (new Profiler())
        {
            var files = traverse(PATH, WHERE);

            foreach (string f in files) { }
        }

        Console.ReadLine();
    }

    static IEnumerable<string> traverse(string path, Func<string, bool> filter)
    {
        foreach (string f in Directory.GetFiles(path).Where(filter))
        {
            yield return f;
        }

        foreach (string d in Directory.GetDirectories(path))
        {
            foreach (string f in traverse(d, filter))
            {
                yield return f;
            }
        }
    }
}

class Profiler : IDisposable
{
    private Stopwatch stopwatch;

    public Profiler()
    {
        this.stopwatch = new Stopwatch();
        this.stopwatch.Start();
    }

    public void Dispose()
    {
        stopwatch.Stop();
        Console.WriteLine("Runing time: {0}ms", this.stopwatch.ElapsedMilliseconds);
        Console.WriteLine("GC.GetTotalMemory(false): {0}", GC.GetTotalMemory(false));
    }
}

I know that you cannot rely to much on GC.GetTotalMemory for memory profiling, but in all my test runs display a little less memory consumption around(100K).

Runing time: 605ms
GC.GetTotalMemory(false): 3444684
Runing time: 577ms
GC.GetTotalMemory(false): 3293368

Upvotes: 1

Rune Grimstad
Rune Grimstad

Reputation: 36310

The GetFiles method only reads the file names, not the file contents, so while reading all the names may be wasteful I don't think this is anything to worry about.

The only alternative as far as I know would be to do multiple GetFiles calls and add the results to a collection, but that gets clumsy and will require you to scan the folder several times, so I suspect it will be slower too.

Upvotes: 1

Konrad Rudolph
Konrad Rudolph

Reputation: 545598

You are right about the memory consumption. However, I think that's a fairly premature optimization. Loading an array of a few thousand strings is no problem at all, neither for performance nor for memory consumption. Reading a directoy containing that many files, however, is – no matter how you store/filter the file names: it will always be relatively slow.

Upvotes: 1

Related Questions