Reputation: 1584
I am trying to implement a general purpose file system crawler that would - for example - be able to enumerate all sub-folders starting at a given root. I would like to do that using the async/await/Task paradigm.
Below is my code so far. It works, but I suspect it can be improved. In particular, the annotated Task.WaitAll
causes unnecessary waits in a deep directory tree, since the loop is suspended waiting at each tree level instead of immediately proceeding to tackle the new folders being added to the folderQueue
.
Somehow I would like to include the new folders being added to the folderQueue
in the list of tasks being waited on by Task.WaitAll()
while the WaitAll
is in progress. Is that even possible?
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Threading.Tasks;
class FileSystemCrawlerSO
{
static void Main(string[] args)
{
FileSystemCrawlerSO crawler = new FileSystemCrawlerSO();
Stopwatch watch = new Stopwatch();
watch.Start();
crawler.CollectFolders(@"d:\www");
watch.Stop();
Console.WriteLine($"Collected {crawler.NumFolders:N0} folders in {watch.ElapsedMilliseconds} milliseconds.");
if (Debugger.IsAttached)
Console.ReadKey();
}
public int NumFolders { get; set; }
private readonly Queue<DirectoryInfo> folderQueue;
public FileSystemCrawlerSO()
{
folderQueue = new Queue<DirectoryInfo>();
}
public void CollectFolders(string path)
{
DirectoryInfo directoryInfo = new DirectoryInfo(path);
lock (folderQueue)
folderQueue.Enqueue(directoryInfo);
List<Task> tasks = new List<Task>();
do
{
tasks.Clear();
lock (folderQueue)
{
while (folderQueue.Any())
{
var folder = folderQueue.Dequeue();
Task task = Task.Run(() => CrawlFolder(folder));
tasks.Add(task);
}
}
if (tasks.Any())
{
Console.WriteLine($"Waiting for {tasks.Count} tasks...");
Task.WaitAll(tasks.ToArray()); //<== NOTE: THIS IS NOT OPTIMAL
}
} while (tasks.Any());
}
private void CrawlFolder(DirectoryInfo dir)
{
try
{
DirectoryInfo[] directoryInfos = dir.GetDirectories();
lock (folderQueue)
foreach (DirectoryInfo childInfo in directoryInfos)
folderQueue.Enqueue(childInfo);
// Do something with the current folder
// e.g. Console.WriteLine($"{dir.FullName}");
NumFolders++;
}
catch (Exception ex)
{
while (ex != null)
{
Console.WriteLine($"{ex.GetType()} {ex.Message}\n{ex.StackTrace}");
ex = ex.InnerException;
}
}
}
}
Upvotes: 5
Views: 12043
Reputation: 1612
Try using Producer-Consumer pattern.
It's a way to crawl directories in one thread and process in the other thread.
public class Program
{
private readonly BlockingCollection<DirectoryInfo> collection = new BlockingCollection<DirectoryInfo>();
public void Run()
{
Task.Factory.StartNew(() => CollectFolders(@"d:\www"));
foreach (var dir in collection.GetConsumingEnumerable())
{
// Do something with the current folder
// e.g. Console.WriteLine($"{dir.FullName}");
}
}
public void CollectFolders(string path)
{
try
{
foreach (var dir in new DirectoryInfo(path).EnumerateDirectories("*", SearchOption.AllDirectories))
{
collection.Add(dir);
}
}
finally
{
collection.CompleteAdding();
}
}
}
If processing is slower than crawling, you may want to use Parallel.ForEach.
Parallel.ForEach(collection.GetConsumingEnumerable(), dir =>
{
// Do something with the current folder
// e.g. Console.WriteLine($"{dir.FullName}");
});
Upvotes: 9
Reputation: 456977
In theory, async/await should be able to help here. In practice, not so much. This is because Win32 does not expose asynchronous APIs for directory functions (or certain file functions, such as opening a file).
Furthermore, parallelizing the disk access by using multiple threads (Task.Run
) has a tendency to backfire, particularly with traditional (non-SSD) disks. Parallel file system access (as opposed to serial file system access) tends to cause disk thrashing, reducing overall throughput.
So, in the general case, I'd recommend just using the blocking directory enumeration methods. For example:
class FileSystemCrawlerSO
{
static void Main(string[] args)
{
var numFolders = 0;
Stopwatch watch = new Stopwatch();
watch.Start();
foreach (var dir in Directory.EnumerateDirectories(@"d:\www", "*", SearchOption.AllDirectories))
{
// Do something with the current folder
// e.g. Console.WriteLine($"{dir.FullName}");
++numFolders;
}
watch.Stop();
Console.WriteLine($"Collected {numFolders:N0} folders in {watch.ElapsedMilliseconds} milliseconds.");
if (Debugger.IsAttached)
Console.ReadKey();
}
}
One nice side effect of doing it the simple way is that there is no longer a race condition on the folder counter variable (NumFolders
).
For a Console app, that's all you need to do. If this is going to be put into a UI app and you don't want to block the UI thread, then a single Task.Run
should suffice.
Upvotes: 17
Reputation: 43906
This is my suggestions. I use generic Concurrent*<>
classes, so I don't have to take care of locks myself (though this does not automatically improve performance).
Then I start a task for every folder and enqueue in the ConcurrentBag<Task>
. After starting the first task, I always wait for the first task in the bag and am finished if no more tasks are there to await.
public class FileSystemCrawlerSO
{
public int NumFolders { get; set; }
private readonly ConcurrentQueue<DirectoryInfo> folderQueue = new ConcurrentQueue<DirectoryInfo>();
private readonly ConcurrentBag<Task> tasks = new ConcurrentBag<Task>();
public void CollectFolders(string path)
{
DirectoryInfo directoryInfo = new DirectoryInfo(path);
tasks.Add(Task.Run(() => CrawlFolder(directoryInfo)));
Task taskToWaitFor;
while (tasks.TryTake(out taskToWaitFor))
taskToWaitFor.Wait();
}
private void CrawlFolder(DirectoryInfo dir)
{
try
{
DirectoryInfo[] directoryInfos = dir.GetDirectories();
foreach (DirectoryInfo childInfo in directoryInfos)
{
// here may be dragons using enumeration variable as closure!!
DirectoryInfo di = childInfo;
tasks.Add(Task.Run(() => CrawlFolder(di)));
}
// Do something with the current folder
// e.g. Console.WriteLine($"{dir.FullName}");
NumFolders++;
}
catch(Exception ex)
{
while (ex != null)
{
Console.WriteLine($"{ex.GetType()} {ex.Message}\n{ex.StackTrace}");
ex = ex.InnerException;
}
}
}
}
I did not measure yet if this is faster than your solution. But I think (as Yacoub Massad) commented, that the bottle neck will rather be the IO system itself and not the way you organize your task.
Upvotes: 8