Reputation: 4741
I wrote an utility which will search all the fixed drives in a system for files of certain extension. Some of the drives contains millions of folders (say, 30 million for example) and the files can be found at a different depth(say, 6th/7th sub folder). Find below the function I am using,
private void ReadDirectories(string targetDirectory)
{
IEnumerable<string> files = Directory.EnumerateFiles(targetDirectory).AsParallel();
ConcurrentBag<string> filesBag = new ConcurrentBag<string>(files);
Parallel.ForEach(filesBag, (file) =>
{
Interlocked.Increment(ref totalFileCount);
if (extension is a text/excel/word file )
{
try
{
// Some logic here
}
catch (AggregateException Aex)
{
Log("Aggregate exception thrown. " + Aex.Message + Aex.StackTrace + Aex.InnerException);
}
catch (Exception ex)
{
Log("File read failed: " + file + ex.Message + ex.StackTrace + ex.InnerException);
return; // This is break equivalent in Parallel.ForEach
}
}
});
IEnumerable<string> directories = Directory.EnumerateDirectories(targetDirectory).AsParallel();
ConcurrentBag<string> directoryBag = new ConcurrentBag<string>(directories);
Parallel.ForEach(directoryBag, (subDirectory) =>
{
try
{
ReadDirectories(subDirectory);
}
catch (AggregateException Aex)
{
Log("Aggregate exception thrown. " + Aex.Message + Aex.StackTrace + Aex.InnerException);
}
catch (UnauthorizedAccessException Uaex)
{
Log("Unauthorized exception: " + Uaex.Message + Uaex.StackTrace + Uaex.InnerException);
return;
}
catch (AccessViolationException Aex)
{
Log("Access violation exception: " + Aex.Message + Aex.StackTrace + Aex.InnerException);
return;
}
catch (Exception ex)
{
Log("Error while reading directories and files : " + ex.Message + ex.StackTrace + ex.InnerException);
return;
}
});
}
The issue I am facing is, once the application starts enumerating folders the physical memory gets consumed more and more and it reaches its peak (99%) after sometime. At this point no other activities can be performed. But my application memory is about 80 -90 MB through out its run. Want to know the reason why the physical memory usage is so high, Is there something wrong with the code?
Upvotes: 1
Views: 1452
Reputation: 131714
As others explained, storing so many strings will eat up a lot of memory and can't scale. Trying to enumerate folders and files in parallel won't speed up processing either.
It's faster to use Directory.EnumerateFiles or even better, DirectoryInfo.EnumerateFiles with SearchOption.AllDirectories
to enumerate all files in the current folder and subfolders and process the files in parallel.
A quick and dirty option would be to use a LINQ query to filter all target files and a Parallel.ForEach to process the files, eg:
var extensions=new[]{".docx", ".xlsx",...};
var folder=new DirectoryInfo(targetDirectory);
var files=from file in folder.EnumerateFiles("*.*", SearchOption.AllDirectories)
where extensions.Contains(file.Extension,StringComparer.InvariantCultureIgnoreCase)
select file;
Parallel.ForEach(files,file=>ProcessFile(file));
This will use roughly as many task as cores in the machine to process files. You can use more tasks by specifying a different MaxDegreeOfParallelism option:
var options=new ParallelOptions { MaxDegreeOfParallelism = 4 }
Parallel.ForEach(files,options,ProcessFile);
Parallel.ForEach
will pull file names from the files
query as needed. It will start processing as soon as EnumerateFiles
returns the first results instead of waiting for all file names to be loaded and cached in memory.
Upvotes: 2
Reputation: 949
Consider your numbers: 30 million folders, each with probably a few files leaves you with something like 100 million strings for file and directory names. And due to the method being recursive, the bags are all kept until the end of the recursion.
So with an average file/directory name length of 100 chars, you're up to 10GB of RAM for the names only.
Upvotes: 3