Nick Kester
Nick Kester

Reputation: 83

VB read/write 5 gb text files

I have an application that reads a 5gb text file line by line and converts double quoted strings that are comma delimited to pipe delimited format. i.e. "Smith, John","Snow, John" --> Smith, John|Snow, John

I have provided my code below. My question is: Is there a more efficient way of processing large files?

Dim fName As String = "C:\LargeFile.csv"
Dim wrtFile As String = "C:\ProcessedFile.txt"
Dim strRead As New System.IO.StreamReader(fName)
Dim strWrite As New System.IO.StreamWriter(wrtFile)
Dim line As String = ""

Do While strRead.Peek <> -1
    line = strRead.ReadLine
    Dim pattern As String = "(,)(?=(?:[^""]|""[^""]*"")*$)"
    Dim replacement As String = "|"
    Dim regEx As New Regex(pattern)

    Dim newLine As String = regEx.Replace(line, replacement)
    newLine = newLine.Replace(Chr(34), "")
    strWrite.WriteLine(newLine)

Loop
strWrite.Close()

UPDATED CODE

Dim fName As String = "C:\LargeFile.csv"
Dim wrtFile As String = "C:\ProcessedFile.txt"
Dim strRead As New System.IO.StreamReader(fName)
Dim strWrite As New System.IO.StreamWriter(wrtFile)
Dim line As String = ""

Do While strRead.Peek <> -1
   line = strRead.ReadLine
   line = line.Replace(Chr(34) + Chr(44) + Chr(34), "|")
   line = line.Replace(Chr(34), "")

   strWrite.WriteLine(line)

Loop

strWrite.Close()

Upvotes: 2

Views: 409

Answers (1)

Andrew Morton
Andrew Morton

Reputation: 25023

I tested your code and attempted to make a speed improvement by accumulating output lines into a StringBuilder. I also moved the regex declaration outside the loop.

When that did not work, I examined the CPU usage and disk I/O with Windows Process Monitor and it turned out that the bottleneck is the CPU (even when using an HDD instead of an SSD).

That led me to try an alternative method for modifying the text: if all you need to do is replace "," with | and remove any remaining double-quotes, then

newLine = line.Replace(""",""", "|").Replace("""", "")

turns out to be much faster (roughly fourfold in my testing) than using a regex.

(Further improvement might be possible with multi-threading, as @Werdna suggested, as long as more than one processor is available and you can coordinate writing back the modified data in the correct order.)

Upvotes: 1

Related Questions