Reputation: 83
I have an application that reads a 5gb text file line by line and converts double quoted strings that are comma delimited to pipe delimited format. i.e. "Smith, John","Snow, John" --> Smith, John|Snow, John
I have provided my code below. My question is: Is there a more efficient way of processing large files?
Dim fName As String = "C:\LargeFile.csv"
Dim wrtFile As String = "C:\ProcessedFile.txt"
Dim strRead As New System.IO.StreamReader(fName)
Dim strWrite As New System.IO.StreamWriter(wrtFile)
Dim line As String = ""
Do While strRead.Peek <> -1
line = strRead.ReadLine
Dim pattern As String = "(,)(?=(?:[^""]|""[^""]*"")*$)"
Dim replacement As String = "|"
Dim regEx As New Regex(pattern)
Dim newLine As String = regEx.Replace(line, replacement)
newLine = newLine.Replace(Chr(34), "")
strWrite.WriteLine(newLine)
Loop
strWrite.Close()
UPDATED CODE
Dim fName As String = "C:\LargeFile.csv"
Dim wrtFile As String = "C:\ProcessedFile.txt"
Dim strRead As New System.IO.StreamReader(fName)
Dim strWrite As New System.IO.StreamWriter(wrtFile)
Dim line As String = ""
Do While strRead.Peek <> -1
line = strRead.ReadLine
line = line.Replace(Chr(34) + Chr(44) + Chr(34), "|")
line = line.Replace(Chr(34), "")
strWrite.WriteLine(line)
Loop
strWrite.Close()
Upvotes: 2
Views: 409
Reputation: 25023
I tested your code and attempted to make a speed improvement by accumulating output lines into a StringBuilder. I also moved the regex declaration outside the loop.
When that did not work, I examined the CPU usage and disk I/O with Windows Process Monitor and it turned out that the bottleneck is the CPU (even when using an HDD instead of an SSD).
That led me to try an alternative method for modifying the text: if all you need to do is replace ","
with |
and remove any remaining double-quotes, then
newLine = line.Replace(""",""", "|").Replace("""", "")
turns out to be much faster (roughly fourfold in my testing) than using a regex.
(Further improvement might be possible with multi-threading, as @Werdna suggested, as long as more than one processor is available and you can coordinate writing back the modified data in the correct order.)
Upvotes: 1