Reputation: 21
No shortage of search for string performance questions out there yet I still can not make heads or tails out of what the best approach is.
Long story short, I have committed to moving from 4NT to PowerShell. In leaving the 4NT I am going to miss the console super quick string searching utility that came with it called FFIND. I have decided to use my rudimentary C# programming skills to try an create my own utility to use in PowerShell that is just as quick.
So far search results on a string search in 100's of directories across a few 1000 files, some of which are quite large, are FFIND 2.4 seconds and my utility 4.4 seconds..... after I have ran mine at least once????
The first time I run them FFIND does it near the same time but mine takes over a minute? What is this? Loading of libraries? File indexing? Am I doing something wrong in my code? I do not mind waiting a little longer but the difference is extreme enough that if there is a better language or approach I would rather start down that path now before I get too invested.
Do I need to pick another language to write a string search that will be lighting fast
I have the need to use this utility to search through 1000 of files for strings in web code, C# code, and another propitiatory language that uses text files. I also need to be able to use this utility to find strings in very large log files, MB size.
class Program
{
public static int linecounter;
public static int filecounter;
static void Main(string[] args)
{
//
//INIT
//
filecounter = 0;
linecounter = 0;
string word;
// Read properties from application settings.
string filelocation = Properties.Settings.Default.FavOne;
// Set Args from console.
word = args[0];
//
//Recursive search for sub folders and files
//
string startDIR;
string filename;
startDIR = Environment.CurrentDirectory;
//startDIR = "c:\\SearchStringTestDIR\\";
filename = args[1];
DirSearch(startDIR, word, filename);
Console.WriteLine(filecounter + " " + "Files found");
Console.WriteLine(linecounter + " " + "Lines found");
Console.ReadKey();
}
static void DirSearch(string dir, string word, string filename)
{
string fileline;
string ColorOne = Properties.Settings.Default.ColorOne;
string ColorTwo = Properties.Settings.Default.ColorTwo;
ConsoleColor valuecolorone = (ConsoleColor)Enum.Parse(typeof(ConsoleColor), ColorOne);
ConsoleColor valuecolortwo = (ConsoleColor)Enum.Parse(typeof(ConsoleColor), ColorTwo);
try
{
foreach (string f in Directory.GetFiles(dir, filename))
{
StreamReader file = new StreamReader(f);
bool t = true;
int counter = 1;
while ((fileline = file.ReadLine()) != null)
{
if (fileline.Contains(word))
{
if (t)
{
t = false;
filecounter++;
Console.ForegroundColor = valuecolorone;
Console.WriteLine(" ");
Console.WriteLine(f);
Console.ForegroundColor = valuecolortwo;
}
linecounter++;
Console.WriteLine(counter.ToString() + ". " + fileline);
}
counter++;
}
file.Close();
file = null;
}
foreach (string d in Directory.GetDirectories(dir))
{
//Console.WriteLine(d);
DirSearch(d,word,filename);
}
}
catch (System.Exception ex)
{
Console.WriteLine(ex.Message);
}
}
}
}
Upvotes: 1
Views: 297
Reputation: 3808
If you want to speed up your code run a performance analysis and see what is taking the most time. I can almost guaruntee the longest step here will be
fileline.Contains(word)
This function is called on every line of the file, on every file. Naively searching for a word in a string can taken len(string) * len(word) comparisons.
You could code your own Contains method, that uses a faster string comparison algorithm. Google for "fast string exact matching". You could try using a regex and seeing if that gives you a performance enhancement. But I think the simplest optimization you can try is :
Don't read every line. Make a large string of all the content of the file.
StreamReader streamReader = new StreamReader(filePath, Encoding.UTF8);
string text = streamReader.ReadToEnd();
Run contains on this.
If you need all the matches in a file, then you need to use something like Regex.Matches(string,string).
After you have used regex to get all the matches for a single file, you can iterate over this match collection (if there are any matches). For each match, you can recover the line of the original file by writing a function that reads forward and backward from the match object index attribute, to where you find the '\n' character. Then output that string between those two newlines, to get your line.
This will be much faster, I guarantee it.
If you want to go even further, some things I've noticed are :
Remove the try catch statement from outside the loop. Only use it exactly where you need it. I would not use it at all.
Also make sure your system is running, ngen. Most setups usually have this, but sometimes ngen is not running. You can see the process in process explorer. Ngen generates a native image of the C# managed bytecode so the code does not have to be interpreted each time, but can be run natively. This speeds up C# a lot.
EDIT
Other points: Why is there a difference between first and subsequent run times? Seems like caching. The OS could have cached the requests for the directories, for the files, for running and loading programs. Usually one sees speedups after a first run. Ngen could also be playing a part here, too, in generating the native image after compilation on the first run, then storing that in the native image cache.
In general, I find C# performance too variable for my liking. If the optimizations suggested are not satisfactory and you want more consistent performance results, try another language -- one that is not 'managed'. C is probably the best for your needs.
Upvotes: 1