Reputation: 597
I have multiple .txt files of 150MB size each. Using C# I need to retrieve all the lines containing the string pattern from each file and then write those lines to a newly created file.
I already looked into similar questions but none of their suggested answers could give me the fastest way of fetching results. I tried regular expressions, linq query, contains method, searching with byte arrays but all of them are taking more than 30 minutes to read and compare the file content.
My test files doesn't have any specific format, it's like raw data which we can't split based on a demiliter and filter based on DataViews.. Below is sample format of each line in that file.
Sample.txt
LTYY;;0,0,;123456789;;;;;;;20121002 02:00;;
ptgh;;0,0,;123456789;;;;;;;20121002 02:00;;
HYTF;;0,0,;846234863;;;;;;;20121002 02:00;;
Multiple records......
My Code
using (StreamWriter SW = new StreamWriter(newFile))
{
using(StreamReader sr = new StreamReader(sourceFilePath))
{
while (sr.Peek() >= 0)
{
if (sr.ReadLine().Contains(stringToSearch))
SW.WriteLine(sr.ReadLine().ToString());
}
}
}
I want a sample code which would take less than a minute to search for 123456789 from the Sample.txt. Let me know if my requirement is not clear. Thanks in advance!
Edit
I found the root cause as having the files residing in a remote server is what consuming more time for reading them because when I copied the files into my local machine, all comparison methods completed very quickly so this isn't issue with the way we read or compare content, they more or less took the same time.
But now how do I address this issue, I can't copy all those files to my machine for comparison and get OutOfMemory exceptions
Upvotes: 2
Views: 1796
Reputation: 172608
Fastest method to search is using the Boyer–Moore string search algorithm as this method not require to read all bytes from the files, but require random access to bytes or you can try using the Rabin Karp Algorithm
or you can try doing something like the following code, from this answer:
public static int FindInFile(string fileName, string value)
{ // returns complement of number of characters in file if not found
// else returns index where value found
int index = 0;
using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName))
{
if (String.IsNullOrEmpty(value))
return 0;
StringSearch valueSearch = new StringSearch(value);
int readChar;
while ((readChar = reader.Read()) >= 0)
{
++index;
if (valueSearch.Found(readChar))
return index - value.Length;
}
}
return ~index;
}
public class StringSearch
{ // Call Found one character at a time until string found
private readonly string value;
private readonly List<int> indexList = new List<int>();
public StringSearch(string value)
{
this.value = value;
}
public bool Found(int nextChar)
{
for (int index = 0; index < indexList.Count; )
{
int valueIndex = indexList[index];
if (value[valueIndex] == nextChar)
{
++valueIndex;
if (valueIndex == value.Length)
{
indexList[index] = indexList[indexList.Count - 1];
indexList.RemoveAt(indexList.Count - 1);
return true;
}
else
{
indexList[index] = valueIndex;
++index;
}
}
else
{ // next char does not match
indexList[index] = indexList[indexList.Count - 1];
indexList.RemoveAt(indexList.Count - 1);
}
}
if (value[0] == nextChar)
{
if (value.Length == 1)
return true;
indexList.Add(1);
}
return false;
}
public void Reset()
{
indexList.Clear();
}
}
Upvotes: 3
Reputation: 71591
150MB is 150MB. If you have one thread going through the entire 150MB, line by line (a "line" being terminated by a newline character/group or by an EOF), your process must read in and spin through all 150MB of the data (not all at once, and it doesn't have to hold all of it at the same time). A linear search through 157,286,400 characters is, very simply, going to take time, and you say you have many such files.
First thing; you're reading the line out of the stream twice. This will, in most cases, actually cause you to read two lines whenever there's a match; what's written to the new file will be the line AFTER the one containing the search string. This is probably not what you want (then again, it may be). If you want to write the line actually containing the search string, read it into a variable before performing the Contains check.
Second, String.Contains() will, by necessity, perform a linear search. In your case, the behavior will actually approach N^2, because when searching for a string within a string, the first character must be found, and where it is, each character is then matched one by one to subsequent characters until all characters in the search string have matched or a non-matching character is found; when a non-match occurs, the algorithm must go back to the character after the initial match to avoid skipping a possible match, meaning it can test the same character many times when checking for a long string against a longer one with many partial matches. This strategy is therefore technically a "brute force" solution. Unfortunately, when you don't know where to look (such as in unsorted data files), there is no more efficient solution.
The only possible speedup I could suggest, other than being able to sort the files' data and then perform an indexed search, is to multithread the solution; if you're only running this method on one thread that looks through every file, not only is only one thread doing the job, but that thread is constantly waiting for the hard drive to serve up the data it needs. Having 5 or 10 threads each working through one file at a time will not only leverage the true power of modern multi-core CPUs more efficiently, but while one thread is waiting on the hard drive, another thread whose data has been loaded can execute, further increasing the efficiency of this approach. Remember, the further away the data is from the CPU, the longer it takes for the CPU to get it, and when your CPU can do between 2 and 4 billion things per second, having to wait even a few milliseconds for the hard drive means you're losing out on millions of potential instructions per second.
Upvotes: 1
Reputation: 14532
As I said already, you should have a database, but whatever.
The fastest, shortest and nicest way to do it (even one-lined) is this:
File.AppendAllLines("b.txt", File.ReadLines("a.txt")
.Where(x => x.Contains("123456789")));
But fast? 150MB is 150MB. It's gonna take a while.
You can replace the Contains
method with your own, for faster comparison, but that's a whole different question.
Other possible solution...
var sb = new StringBuilder();
foreach (var x in File.ReadLines("a.txt").Where(x => x.Contains("123456789")))
{
sb.AppendLine(x);
}
File.WriteAllText("b.txt", sb.ToString()); // That is one heavy operation there...
Testing it with a file size 150MB, and it found all results within 3 seconds. The thing that takes time is writing the results into the 2nd file (in case there are many results).
Upvotes: 1
Reputation: 6517
You're going to experience performance problems in your approaches of blocking input from these files while doing string comparisons.
But Windows has a pretty high performance GREP-like tool for doing string searches of text files called FINDSTR that might be fast enough. You could simply call it as a shell command or redirect the results of the command to your output file.
Either preprocessing (sort) or loading your large files into a database will be faster, but I'm assuming that you already have existing files you need to search.
Upvotes: 0
Reputation: 43693
Do not read and write at same time. Search first, save list of matching lines and write it to file at the end.
using System;
using System.Collections.Generic;
using System.IO;
...
List<string> list = new List<string>();
using (StreamReader reader = new StreamReader("input.txt")) {
string line;
while ((line = reader.ReadLine()) != null) {
if (line.Contains(stringToSearch)) {
list.Add(line); // Add to list.
}
}
}
using (StreamWriter writer = new StreamWriter("output.txt")) {
foreach (string line in list) {
writer.WriteLine(line);
}
}
Upvotes: 0
Reputation:
I don't know how long this will take to run, but here are some improvements:
using (StreamWriter SW = new StreamWriter(newFile))
{
using (StreamReader sr = new StreamReader(sourceFilePath))
{
while (!sr.EndOfStream)
{
var line = sr.ReadLine();
if (line.Contains(stringToSearch))
SW.WriteLine(line);
}
}
}
Note that you don't need Peek
, EndOfStream
will give you what you want. You were calling ReadLine
twice (probably not what you had intended). And there's no need to call ToString()
on a string
.
Upvotes: 1
Reputation: 31196
I'm not giving you sample code, but have you tried sorting the content of your files?
trying to search for a string from 150MB worth of files is going to take some time any way you slice it, and if regex takes too long for you, than I'd suggest maybe sorting the content of your files, so that you know roughly where "123456789"
will occur before you actually search, that way you won't have to search the unimportant parts.
Upvotes: 0