P.Windridge
P.Windridge

Reputation: 246

PowerShell Get-Content with basic manipulations so slow

I am merging a lot of large CSV files, e.g. while skipping the leading junk and appending the filename to each line:

Get-ChildItem . | Where Name -match "Q[0-4]20[0-1][0-9].csv" | 
Foreach-Object {
    $file = $_.BaseName
    Get-Content $_.FullName | select-object -skip 3 | % {
        "$_,${file}" | Out-File -Append temp.csv -Encoding ASCII
    }
}

In PowerShell this is incredibly slow even on an i7/16GB machine (~5 megabyte/minute). Can I make it more efficient or should I just switch to e.g. Python?

Upvotes: 4

Views: 4649

Answers (3)

Matt
Matt

Reputation: 46710

Get-Content / Set-Content are terrible with larger files. Streams are a good alternative when performance is key. So with that in mind lets use one to read in each file and another to write out the results.

$rootPath = "C:\temp"
$outputPath = "C:\test\somewherenotintemp.csv"
$streamWriter = [System.IO.StreamWriter]$outputPath
Get-ChildItem $rootPath -Filter "*.csv" -File  | ForEach-Object{
    $file = $_.BaseName
    [System.IO.File]::ReadAllLines($_.FullName) | 
        Select-Object -Skip 3 | ForEach-Object{
            $streamWriter.WriteLine(('{0},"{1}"' -f $_,$file))
    }
}
$streamWriter.Close(); $streamWriter.Dispose()

Create a writing stream $streamWriter to output the edited lines in each file. We could read in the file and write the file in larger batches, which would be faster, but we need to ignore a few lines and make changes to each one so processing line by line is simpler. Avoid writing anything to console during this time as it will just slow everything down.

What '{0},"{1}"' -f $_,$file does is quote that last "column" that is added in case the basename contains spaces.

Upvotes: 4

Chris
Chris

Reputation: 1009

To avoid -Append to ruin the performance of your script you could use a buffer array variable:

# Initialize buffer
$csvBuffer = @()

Get-ChildItem *.csv | Foreach-Object {
    $file = $_.BaseName
    $content = Get-Content $_.FullName | Select-Object -Skip 3 | %{
        "$_,${file}" 
    }

    # Populate buffer
    $csvBuffer += $content

    # Write buffer to disk if it contains 5000 lines or more
    $csvBufferCount = $csvBuffer | Measure-Object | Select-Object -ExpandProperty Count
    if( $csvBufferCount -ge 5000 )
    {
        $csvBuffer | Out-File -Path temp.csv -Encoding ASCII -Append
        $csvBuffer = @()
    }
}

# Important : empty the buffer remainder
if( $csvBufferCount -gt 0 )
{
    $csvBuffer | Out-File -Path temp.csv -Encoding ASCII -Append
    $csvBuffer = @()
}

Upvotes: 1

Snak3d0c
Snak3d0c

Reputation: 626

Measure-Command -Expression {
    Get-ChildItem C:\temp | Where Name -like "*.csv" | ForEach-Object {
        $file = $_.BaseName
        Get-Content $_.FullName | select-object -Skip 3 | ForEach-Object {
            "$_,$($file)" | Out-File -Append C:\temp\t\tempe1.csv -Encoding ASCII -Force
        }
    }
} # TotalSeconds      : 12,0526802 for 11415 lines

If you first put everything into an array in memory, things go a lot faster:

Measure-Command -Expression {
    $arr = @()
    Get-ChildItem C:\temp | Where Name -like "*.csv" | ForEach-Object {
        $file = $_.BaseName
        $arr += Get-Content $_.FullName | select-object -Skip 3 | ForEach-Object {
            "$_,$($file)" 
        }
    }
    $arr | Out-File -Append C:\temp\t\tempe2.csv -Encoding ASCII -Force
} # TotalSeconds      :  0,8197193  for 11415 lines

EDIT: Fixed it so that your filename was added to each row.

Upvotes: 2

Related Questions