Reputation: 11019
I wrote a Winform application that reads in each line of a text file, does a search and replace using RegEx on the line, and then it writes back out to a new file. I chose the "line by line" method as some of the files are just too large to load into memory.
I am using the BackgroundWorker object so the UI can be updated with the progress of the job. Below is the code (with parts omitted for brevity) that handles the reading and then outputting of the lines in the file.
public void bgWorker_DoWork(object sender, DoWorkEventArgs e)
{
// Details of obtaining file paths omitted for brevity
int totalLineCount = File.ReadLines(inputFilePath).Count();
using (StreamReader sr = new StreamReader(inputFilePath))
{
int currentLine = 0;
String line;
while ((line = sr.ReadLine()) != null)
{
currentLine++;
// Match and replace contents of the line
// omitted for brevity
if (currentLine % 100 == 0)
{
int percentComplete = (currentLine * 100 / totalLineCount);
bgWorker.ReportProgress(percentComplete);
}
using (FileStream fs = new FileStream(outputFilePath, FileMode.Append, FileAccess.Write))
using (StreamWriter sw = new StreamWriter(fs))
{
sw.WriteLine(line);
}
}
}
}
Some of the files I am processing are very large (8 GB with 132 million rows). The process takes a very long time (a 2 GB file took about 9 hours to complete). It looks to be working at around 58 KB/sec. Is this expected or should the process be going faster?
Upvotes: 2
Views: 5111
Reputation: 127543
Don't close and re-open the writing file every loop iteration, just open the writer outside the file loop. This should improve performance as the writer no longer needs to seek to the end of the file every single loop iteration.
AlsoFile.ReadLines(inputFilePath).Count();
is causing you to read your input file twice and could be a big chunk of time. Instead of a percentage based off of lines calculate the percentage based off of stream position.
public void bgWorker_DoWork(object sender, DoWorkEventArgs e)
{
// Details of obtaining file paths omitted for brevity
using (StreamWriter sw = new StreamWriter(outputFilePath, true)) //You can use this constructor instead of FileStream, it does the same operation.
using (StreamReader sr = new StreamReader(inputFilePath))
{
int lastPercentage = 0;
String line;
while ((line = sr.ReadLine()) != null)
{
// Match and replace contents of the line
// omitted for brevity
//Poisition and length are longs not ints so we need to cast at the end.
int currentPercentage = (int)(sr.BaseStream.Position * 100L / sr.BaseStream.Length);
if (lastPercentage != currentPercentage )
{
bgWorker.ReportProgress(currentPercentage );
lastPercentage = currentPercentage;
}
sw.WriteLine(line);
}
}
}
Other than that you will need to show what Match and replace contents of the line omitted for brevity
does as I would guess that is where your slowness comes from. Run a profiler on your code and see where it is taking the most time and focus your efforts there.
Upvotes: 15
Reputation: 3539
Remove the ReadAllLines method at the top as the reads through whole file just to get numberof lines.
Upvotes: 0
Reputation: 17010
Follow this process:
This should be a LOT faster than instantiating the writer on each line loop, as you have.
I will append this with a code sample shortly. Looks like someone else beat me to the punch on code samples - see @Scott Chamberlain's answer.
Upvotes: 1