Tchotchke
Tchotchke

Reputation: 399

Changing the Delimiter in a large CSV file using Powershell

I am in need of a way to change the delimiter in a CSV file from a comma to a pipe. Because of the size of the CSV files (~750 Mb to several Gb), using Import-CSV and/or Get-Content is not an option. What I'm using (and what works, albeit slowly) is the following code:

$reader = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser $source
$reader.SetDelimiters(",")

While(!$reader.EndOfData)
{   
    $line = $reader.ReadFields()
    $details = [ordered]@{
                            "Plugin ID" = $line[0]
                            CVE = $line[1]
                            CVSS = $line[2]
                            Risk = $line[3]     
                         }                        
    $export = New-Object PSObject -Property $details
    $export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"    
}

This little loop took nearly 2 minutes to process a 20 Mb file. Scaling up at this speed would mean over an hour for the smallest CSV file I'm currently working with.

I've tried this as well:

While(!$reader.EndOfData)
{   
    $line = $reader.ReadFields()  

    $details = [ordered]@{
                             # Same data as before
                         }

    $export.Add($details) | Out-Null        
}

$export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"

This is MUCH FASTER but doesn't provide the right information in the new CSV. Instead I get rows and rows of this:

"Count"|"IsReadOnly"|"Keys"|"Values"|"IsFixedSize"|"SyncRoot"|"IsSynchronized"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"

So, two questions:

1) Can the first block of code be made faster? 2) How can I unwrap the arraylist in the second example to get to the actual data?

EDIT: Sample data found here - http://pastebin.com/6L98jGNg

Upvotes: 2

Views: 2516

Answers (2)

woxxom
woxxom

Reputation: 73586

This is simple text-processing, so the bottleneck should be disk read speed: 1 second per 100 MB or 10 seconds per 1GB for the OP's sample (repeated to the mentioned size) as measured here on i7. The results would be worse for files with many/all small quoted fields.

The algo is simple:

  1. Read the file in big string chunks e.g. 1MB.
    It's much faster than reading millions of lines separated by CR/LF because:
    • less checks are performed as we mostly/primarily look only for doublequotes;
    • less iterations of our code executed by the interpreter which is slow.
  2. Find the next doublequote.
  3. Depending on the current $inQuotedField flag decide whether the found doublequote starts a quoted field (should be preceded by , + some spaces optionally) or ends the current quoted field (should be followed by any even number of doublequotes, optionally spaces, then ,).
  4. Replace delimiters in the preceding span or to the end of 1MB chunk if no quotes were found.

The code makes some reasonable assumptions but it may fail to detect an escaped field if its doublequote is followed or preceded by more than 3 spaces before/after field delimiter. The checks won't be too hard to add, and I might've missed some other edge case, but I'm not that interested.

$sourcePath = 'c:\path\file.csv'
$targetPath = 'd:\path\file2.csv'
$targetEncoding = [Text.UTF8Encoding]::new($false) # no BOM

$delim = [char]','
$newDelim = [char]'|'

$buf = [char[]]::new(1MB)
$sourceBase = [IO.FileStream]::new(
    $sourcePath,
    [IO.FileMode]::open,
    [IO.FileAccess]::read,
    [IO.FileShare]::read,
    $buf.length,  # let OS prefetch the next chunk in background
    [IO.FileOptions]::SequentialScan)
$source = [IO.StreamReader]::new($sourceBase, $true) # autodetect encoding
$target = [IO.StreamWriter]::new($targetPath, $false, $targetEncoding, $buf.length)

$bufStart = 0
$bufPadding = 4
$inQuotedField = $false
$fieldBreak = [char[]]@($delim, "`r", "`n")
$out = [Text.StringBuilder]::new($buf.length)

while ($nRead = $source.Read($buf, $bufStart, $buf.length-$bufStart)) {
    $s = [string]::new($buf, 0, $nRead+$bufStart)
    $len = $s.length
    $pos = 0
    $out.Clear() >$null

    do {
        $iQuote = $s.IndexOf([char]'"', $pos)
        if ($inQuotedField) {
            $iDelim = if ($iQuote -ge 0) { $s.IndexOf($delim, $iQuote+1) }
            if ($iDelim -eq -1 -or $iQuote -le 0 -or $iQuote -ge $len - $bufPadding) {
                # no closing quote in buffer safezone
                $out.Append($s.Substring($pos, $len-$bufPadding-$pos)) >$null
                break
            }
            if ($s.Substring($iQuote, $iDelim-$iQuote+1) -match "^(""+)\s*$delim`$") {
                # even number of quotes are just quoted quotes
                $inQuotedField = $matches[1].length % 2 -eq 0
            }
            $out.Append($s.Substring($pos, $iDelim-$pos+1)) >$null
            $pos = $iDelim + 1
            continue
        }
        if ($iQuote -ge 0) {
            $iDelim = $s.LastIndexOfAny($fieldBreak, $iQuote)
            if (!$s.Substring($iDelim+1, $iQuote-$iDelim-1).Trim()) {
                $inQuotedField = $true
            }
            $replaced = $s.Substring($pos, $iQuote-$pos+1).Replace($delim, $newDelim)
        } elseif ($pos -gt 0) {
            $replaced = $s.Substring($pos).Replace($delim, $newDelim)
        } else {
            $replaced = $s.Replace($delim, $newDelim)
        }
        $out.Append($replaced) >$null
        $pos = $iQuote + 1
    } while ($iQuote -ge 0)

    $target.Write($out)

    $bufStart = 0
    for ($i = $out.length; $i -lt $s.length; $i++) {
        $buf[$bufStart++] = $buf[$i]
    }
}
if ($bufStart) { $target.Write($buf, 0, $bufStart) }
$source.Close()
$target.Close()

Upvotes: 4

TheMadTechnician
TheMadTechnician

Reputation: 36297

Still not what I would call fast, but this is considerably faster than what you have listed by using the -Join operator:

$reader = New-Object Microsoft.VisualBasic.fileio.textfieldparser $source
$reader.SetDelimiters(",")

While(!$reader.EndOfData){
    $line = $reader.ReadFields()
    $line -join '|' | Add-Content C:\Temp\TestOutput.csv
}

That took a hair under 32 seconds to process a 20MB file. At that rate your 750MB file would be done in under 20 minutes, and bigger files should go at about 26 minutes per gig.

Upvotes: 0

Related Questions