Researching efficient searching of text inside a directory and its subdirectories using C#

I am trying to search for a particular occurrence of a string in some files belonging to a directory. (The search is also performed in the sub directories. Currently, I came up with a solution something like this.

  1. Get all filenames inside a directory and its sub directories.
  2. Open files one by one.
  3. Search for a particular string
  4. If it contains, store filename in an array.
  5. Continue this till the last file.

    string[] fileNames = Directory.GetFiles(@"d:\test", "*.txt", SearchOption.AllDirectories);
    foreach (string sTem in fileNames)
    {
        foreach (string line in File.ReadAllLines(sTem))
        {
            if (line.Contains(SearchString))
            {
                MessageBox.Show("Found search string!");
                break;
            }
        }
    }
    

I think there can be other methods/approach efficient and speeder than this? Using a batch file? OK. Another, solution is to use findstr (but how to use it directly with C# program without a batch file ? What is the most efficient (or more efficient than what I did?) Code examples are much appreciated!

Found out another solution.

Process myproc = new Process();
myproc.StartInfo.FileName = "findstr";
myproc.StartInfo.Arguments = "/m /s /d:\"c:\\REQs\" \"madhuresh\" *.req";
myproc.StartInfo.RedirectStandardOutput = true;
myproc.StartInfo.UseShellExecute = false;


myproc.Start();
string output = myproc.StandardOutput.ReadToEnd();
myproc.WaitForExit();

Is this execution of a process good ? Comments on this too are welcome!

According to the @AbitChev's method, a sleek (I don't know if it's efficient!). Anyways, it goes on like this. This one searches all the directory as well as the subdirectories!

IEnumerable<string> s = from file in Directory.EnumerateFiles("c:\\directorypath", "*.req", SearchOption.AllDirectories)
                   from str in File.ReadLines(file)
                   //where str.Contains("Text@tosearched2")
                   where str.IndexOf(sSearchItem, StringComparison.OrdinalIgnoreCase) >= 0
                   select file;

        foreach (string sa in s)
            MessageBox.Show(sa);

(for case-insensitive search. Maybe that could help someone.) Please comment! Thanks.

Upvotes: 1

Views: 1133

Answers (4)

Patrick Herrington
Patrick Herrington

Reputation: 11

This works well. I searched around 500 terms over 230 files in under .5 milliseconds. This is very memory intensive; it loads every file into memory

public class FindInDirectory
{
    public class Match
    {
        public string Pattern { get; set; }
        public string Directory { get; set; }
        public MatchCollection Matches { get; set; }
    }

    public static List<FindInDirectory.Match> Search(string directory, string searchPattern, List<string> patterns)
    {
        //find all file locations
        IEnumerable<string> files = System.IO.Directory.EnumerateFiles(directory, searchPattern, System.IO.SearchOption.AllDirectories);

        //load all text into memory for MULTI-PATERN
        //this greatly increases speed, but it requires a ton of memory!
        Dictionary<string, string> contents = files.ToDictionary(f => f, f => System.IO.File.ReadAllText(f));

        List<FindInDirectory.Match> directoryMatches = new List<Match>();

        foreach (string pattern in patterns)
        {
            directoryMatches.AddRange
            (
                contents.Select(c => new Match
                {
                    Pattern = pattern,
                    Directory = c.Key,
                    Matches = Regex.Matches(c.Value, pattern, RegexOptions.IgnoreCase | RegexOptions.Multiline)
                })
                .Where(c => c.Matches.Count > 0)//switch to > 1 when program directory is same or child of search
            );
        };

        return directoryMatches;
    }

}

USE:

    static void Main(string[] args)
    {
        List<string> patterns = new List<string>
        {
            "class",
            "foreach",
            "main",
        };
        string searchPattern = "*.cs";
        string directory = "C:\\SearchDirectory";

        DateTime start = DateTime.UtcNow;

        FindInDirectory.Search(directory, searchPattern, patterns);

        Console.WriteLine((DateTime.UtcNow - start).TotalMilliseconds);
        Console.ReadLine();
    }

Upvotes: 1

abatishchev
abatishchev

Reputation: 100248

Use Directory.EnumerateFiles() and File.ReadLines() - both provides lazy loading of data:

from file in Directory.EnumerateFiles(path)
from arr in File.ReadLines(file)
from str in arr
where str.Contains(pattern)
select new 
{
    FileName = file, // file containing matched string
    Line = str // matched string
};

or

foreach (var file in Directory.EnumerateFiles(path).AsParallel())
{
    try
    {
        foreach (var arr in File.ReadLines(file).AsParallel())
        {
            // one more try here?
            foreach (var str in arr)
            {
                if (str.Contains(pattern))
                {
                    yield return new 
                    {
                        FileName = file, // file containing matched string
                        Line = str // matched string
                    };
                }
            }
        }
    }
    catch (SecurityException)
    {
        // swallow or log
    }
}

Upvotes: 3

Jodrell
Jodrell

Reputation: 35696

How about somthing like this

var found = false;
string file;

foreach (file in Directory.EnumerateFiles(
            "d:\\tes\\",
            "*.txt",
            SearchOption.AllDirectories))
{
    foreach(var line in File.ReadLines(file))
    {
        if (line.Contains(searchString))
        {
            found = ture;
            break;
        }
    }

    if (found)
    {
            break;
    }
}

if (found)
{
    var message = string.Format("Search string found in \"{0}\".", file)
    MessageBox.Show(file);
}

This has the advantage of loading only what is required into memory, rather than the names of all the files then, the contents of each file.


I note you are using String.Contains which

performs an ordinal (case-sensitive and culture-insensitive) comparison

This would allow us to do a simple charachter wise compare.

I'd start with a little helper function

private static bool CompareCharBuffers(
    char[] buffer,
    int headPosition,
    char[] stringChars)
{
    // null checking and length comparison ommitted

    var same = true;
    var bufferPos = headPosition;
    for (var i = 0; i < stringChars.Length; i++)
    {
        if (!stringChars[i].Equals(buffer[bufferPos]))
        {
            same = false;
            break;
        }

        bufferPos = ++bufferPos % (buffer.Length - 1);
    }

    return same;
}

Then I'd alter the previous algorithm to use the function like this.

var stringChars = searchString.ToCharArray();
var found = false;
string file;


foreach (file in Directory.EnumerateFiles(
            "d:\\tes\\",
            "*.txt",
            SearchOption.AllDirectories))
{
    using (var reader = File.OpenText(file))
    {
        var buffer = new char[stringChars.Length];
        if (reader.ReadBlock(buffer, 0, buffer.Length - 1) 
                < stringChars.Length - 1)
        {
            continue;
        }

        var head = 0;
        var nextPos = buffer.Length - 1;
        var nextChar = reader.Read();
        while (nextChar != -1)
        {
            buffer[nextPos] = (char)nextChar;

            if (CompareCharBuffers(buffer, head, stringChars))
            {
               found = ture;
               break;
            }

            head = ++head % (buffer.Length - 1);
            if (head == 0)
            {
                nextPos = buffer.Length - 1;
            }
            else
            {
                nextPos = head - 1;
            } 

            nextChar = reader.Read();
        }

        if (found)
        {
            break;
        }
    }
}

if (found)
{
    var message = string.Format("Search string found in \"{0}\".", file)
    MessageBox.Show(file);
}

this holds only as many chars as the search string contains in memory and uses rolling buffer across each file. Theoretically the file could contain no new lines and consume your whole disk, or, your search string could contain a new line.


As further work I'd convert the per file part of the algorithm into a function and investigate a multi-threaded approach.

So this would be the internal function,

static bool FileContains(string file, char[] stringChars)
{
    using (var reader = File.OpenText(file))
    {
        var buffer = new char[stringChars.Length];
        if (reader.ReadBlock(buffer, 0, buffer.Length - 1) 
                < stringChars.Length - 1)
        {
            return false;
        }

        var head = 0;
        var nextPos = buffer.Length - 1;
        var nextChar = reader.Read();
        while (nextChar != -1)
        {
            buffer[nextPos] = (char)nextChar;

            if (CompareCharBuffers(buffer, head, stringChars))
            {
               return true;
            }

            head = ++head % (buffer.Length - 1);
            if (head == 0)
            {
                nextPos = buffer.Length - 1;
            }
            else
            {
                nextPos = head - 1;
            } 

            nextChar = reader.Read();
        }

        return false;
    }
}

Then you could process the files in parallel like this

var stringChars = searchString.ToCharArray();

if (Directory.EnumerateFiles(
            "d:\\tes\\",
            "*.txt",
            SearchOption.AllDirectories)
    .AsParallel()
    .Any(file => FileContains(file, stringChars)))
{
    MessageBox.Show("Found search string!");
}

Upvotes: 2

varg
varg

Reputation: 3636

You can create a "Pipeline" with Tasks.Dataflow (this .dll isn't currently part of .NET 4.5, but you can download it from here) to consume all files and searching for explicit strings. Take a look at this Reference Implementation.

Upvotes: 0

Related Questions