DX101
DX101

Reputation: 158

Slow Powershell script for CSV modification

I'm using a powershell script to append data to the end of a bunch of files. Each file is a CSV around 50Mb (Say 2 millionish lines), there are about 50 files.

The script I'm using looks like this:

$MyInvocation.MyCommand.path

$files = ls *.csv 

foreach($f in $files) 
{
$baseName = [System.IO.Path]::GetFileNameWithoutExtension($f)
$year = $basename.substring(0,4)

Write-Host "Starting" $Basename

$r = [IO.File]::OpenText($f)
while ($r.Peek() -ge 0) {
    $line = $r.ReadLine()
    $line + "," + $year | Add-Content $(".\DR_" + $basename + ".CSV")
}
$r.Dispose()

}

Problem is, it's pretty slow. It's taken about 12 hours to get through them. It's not super complex, so I wouldn't expect it to take that long to run. What could I do to speed it up?

Upvotes: 1

Views: 2255

Answers (2)

alroc
alroc

Reputation: 28174

Don't go into .NET Framework static methods and building up strings when there are cmdlets that can do the work on objects. Collect your data, add the year column, then export to your new file. You're also doing a ton of file I/O and that'll also slow you down.

This will probably require a little bit more memory. But it reads the whole file at once, and writes the whole file at once. It also assumes that your CSV files have column headings. But it's much easier for someone else to look at and understand exactly what's going on (write your scripts so they can be read!).

# Always use full cmdlet names in scripts, not aliases
$files = get-childitem *.csv;

foreach($f in $files) 
{
    #basename is a property of the file object in PowerShell, there's no need to call a static method
    $basename = $f.basename;
    $year = $f.basename.substring(0,4)

    # Every time you use Write-Host, a puppy dies
    "Starting $Basename";

    # If you've got CSV data, treat it as CSV data. PowerShell can import it into a collection natively.
    $data = Import-Csv $f;
    $exportData = @();
    foreach ($row in $data) {
# Add a year "property" to each row object
        $row |Add-Member -membertype NoteProperty -Name "Year" -Value $year;
# Export the modified row to the output file
        $row |Export-Csv -NoTypeInformation -Path $("r:\DR_" + $basename + ".CSV") -Append -NoClobber
    }
}

Upvotes: 0

vonPryz
vonPryz

Reputation: 24071

Reading and writing a file row by row can be a bit slow. Maybe your antivirus is contributing to slowness as well. Use Measure-Command to see which parts of the script are the slow ones.

As a general advise, rather write a few large blocks instead of lots of small ones. You can achieve this by storing some content in a StringBuilder and appending its contents into the output file every, say, 1000 processed rows. Like so,

$sb = new-object Text.StringBuilder # New String Builder for stuff
$i = 1 # Row counter
while ($r.Peek() -ge 0) {
    # Add formatted stuff into the buffer
    [void]$sb.Append($("{0},{1}{2}" -f $r.ReadLine(), $year, [Environment]::NewLine ) )

    if(++$i % 1000 -eq 0){ # When 1000 rows are added, dump contents into file
      Add-Content $(".\DR_" + $basename + ".CSV") $sb.ToString()
      $sb = new-object Text.StringBuilder # Reset the StringBuilder
    }
}
# Don't miss the tail of the contents
Add-Content $(".\DR_" + $basename + ".CSV") $sb.ToString()

Upvotes: 3

Related Questions