Reputation: 246
I am merging a lot of large CSV files, e.g. while skipping the leading junk and appending the filename to each line:
Get-ChildItem . | Where Name -match "Q[0-4]20[0-1][0-9].csv" |
Foreach-Object {
$file = $_.BaseName
Get-Content $_.FullName | select-object -skip 3 | % {
"$_,${file}" | Out-File -Append temp.csv -Encoding ASCII
}
}
In PowerShell this is incredibly slow even on an i7/16GB machine (~5 megabyte/minute). Can I make it more efficient or should I just switch to e.g. Python?
Upvotes: 4
Views: 4649
Reputation: 46710
Get-Content
/ Set-Content
are terrible with larger files. Streams are a good alternative when performance is key. So with that in mind lets use one to read in each file and another to write out the results.
$rootPath = "C:\temp"
$outputPath = "C:\test\somewherenotintemp.csv"
$streamWriter = [System.IO.StreamWriter]$outputPath
Get-ChildItem $rootPath -Filter "*.csv" -File | ForEach-Object{
$file = $_.BaseName
[System.IO.File]::ReadAllLines($_.FullName) |
Select-Object -Skip 3 | ForEach-Object{
$streamWriter.WriteLine(('{0},"{1}"' -f $_,$file))
}
}
$streamWriter.Close(); $streamWriter.Dispose()
Create a writing stream $streamWriter
to output the edited lines in each file. We could read in the file and write the file in larger batches, which would be faster, but we need to ignore a few lines and make changes to each one so processing line by line is simpler. Avoid writing anything to console during this time as it will just slow everything down.
What '{0},"{1}"' -f $_,$file
does is quote that last "column" that is added in case the basename contains spaces.
Upvotes: 4
Reputation: 1009
To avoid -Append
to ruin the performance of your script you could use a buffer array variable:
# Initialize buffer
$csvBuffer = @()
Get-ChildItem *.csv | Foreach-Object {
$file = $_.BaseName
$content = Get-Content $_.FullName | Select-Object -Skip 3 | %{
"$_,${file}"
}
# Populate buffer
$csvBuffer += $content
# Write buffer to disk if it contains 5000 lines or more
$csvBufferCount = $csvBuffer | Measure-Object | Select-Object -ExpandProperty Count
if( $csvBufferCount -ge 5000 )
{
$csvBuffer | Out-File -Path temp.csv -Encoding ASCII -Append
$csvBuffer = @()
}
}
# Important : empty the buffer remainder
if( $csvBufferCount -gt 0 )
{
$csvBuffer | Out-File -Path temp.csv -Encoding ASCII -Append
$csvBuffer = @()
}
Upvotes: 1
Reputation: 626
Measure-Command -Expression {
Get-ChildItem C:\temp | Where Name -like "*.csv" | ForEach-Object {
$file = $_.BaseName
Get-Content $_.FullName | select-object -Skip 3 | ForEach-Object {
"$_,$($file)" | Out-File -Append C:\temp\t\tempe1.csv -Encoding ASCII -Force
}
}
} # TotalSeconds : 12,0526802 for 11415 lines
If you first put everything into an array in memory, things go a lot faster:
Measure-Command -Expression {
$arr = @()
Get-ChildItem C:\temp | Where Name -like "*.csv" | ForEach-Object {
$file = $_.BaseName
$arr += Get-Content $_.FullName | select-object -Skip 3 | ForEach-Object {
"$_,$($file)"
}
}
$arr | Out-File -Append C:\temp\t\tempe2.csv -Encoding ASCII -Force
} # TotalSeconds : 0,8197193 for 11415 lines
EDIT: Fixed it so that your filename was added to each row.
Upvotes: 2