Reputation: 504
I have large CSV files that 0.5-2gb+ files I am trying to import with Powershell.
Data looks like so:
Name, Date, Value
"Joe, John", 2016-08-01, "value"
"Smith, Jane", 2016-08-01, "value" ...
I have this function
$elapsed = [System.Diagnostics.Stopwatch]::StartNew()
$reader = new-object System.IO.StreamReader($csv)
while (($line = $reader.ReadLine()) -ne $null) {
# Use RegEx to only split on (,) outside quotes and remove quoted strings
$row = ($line -split ',(?=(?:[^"]|"[^"]*")*$)').Replace("`"","")
# Row Indicator
$i++;
if (($i % 50000) -eq 0) {
Write-Host "$i rows have been processed in $($elapsed.Elapsed.ToString())."
}
}
Splitting the line by a comma "," works perfect as I get ~16K a second, but I need to only split outside of any quotes, so I implemented the regular expression, however the performance tanks to 900 rows a second.
I am looking for a more efficient way to loop through a CSV file that is comma delimited but has commas in the quotes that need to be excluded.
Upvotes: 1
Views: 3367
Reputation: 4240
Import-Csv, as noted in the comments above, does not load everything into memory unless you ask it to. Like the example in the question it implements a stream reader and pushes the content it's read off to the output pipeline.
You will see significant memory usage if you do something like this:
$var = Import-Csv thefile.csv
After all, the content of the CSV has to go somewhere.
Whereas if you do something with the output pipeline there's less impact. e.g.
Import-Csv thefile.csv | ForEach-Object {
Do-Something
}
Finally, Import-Csv really doesn't work for you I have a CSV reader class along with a side-by-side implementation of Import-Csv called Indented.Text.Csv on github. This implementation provides a public class with a number of features I needed so I could process CSV files very quickly.
Upvotes: 3