Bnd10706
Bnd10706

Reputation: 2363

Powershell to Break up CSV by Number of Row

So I am now tasked with getting constant reports that are more than 1 Million lines long.

My last question did not explain all things so I'm tryin got do a better question.

I'm getting a dozen + daily reports that are coming in as CSV files. I don't know what the headers are or anything like that as I get them.

They are huge. I cant open in excel.

I wanted to basically break them up into the same report, just each report maybe 100,000 lines long.

The code I wrote below does not work as I keep getting a

Exception of type 'System.OutOfMemoryException' was thrown.

I am guessing I need a better way to do this.

I just need this file broken down to a more manageable size. It does not matter how long it takes as I can run it over night.

I found this on the internet, and I tried to manipulate it, but I cant get it to work.

$PSScriptRoot

write-host $PSScriptRoot

$loc = $PSScriptRoot

$location = $loc

# how many rows per CSV?
$rowsMax = 10000; 

# Get all CSV under current folder
$allCSVs = Get-ChildItem "$location\Split.csv"


# Read and split all of them
$allCSVs | ForEach-Object {
    Write-Host $_.Name;
    $content = Import-Csv "$location\Split.csv"
    $insertLocation = ($_.Name.Length - 4);
    for($i=1; $i -le $content.length ;$i+=$rowsMax){
    $newName = $_.Name.Insert($insertLocation, "splitted_"+$i)
    $content|select -first $i|select -last $rowsMax | convertto-csv -NoTypeInformation | % { $_ -replace '"', ""} | out-file $location\$newName -fo -en ascii
    }
}

Upvotes: 3

Views: 2638

Answers (2)

Mike Twc
Mike Twc

Reputation: 2355

Another option from linux world - split command. To get it on windows just install git bash, then you'll be able to use many linux tools in your CMD/powershell. Below is the syntax to achieve your goal:

split  -l 100000 --numeric-suffixes --suffix-length 3 --additional-suffix=.csv sourceFile.csv outputfile

It's very fast. If you want you can wrap split.exe as a cmdlet

Upvotes: 0

mklement0
mklement0

Reputation: 437197

The key is not to read large files into memory in full, which is what you're doing by capturing the output from Import-Csv in a variable ($content = Import-Csv "$location\Split.csv").

That said, while using a single pipeline would solve your memory problem, performance will likely be poor, because you're converting from and back to CSV, which incurs a lot of overhead.

Even reading and writing the files as text with Get-Content and Set-Content is slow, however.
Therefore, I suggest a .NET-based approach for processing the files as text, which should substantially speed up processing.

The following code demonstrates this technique:

Get-ChildItem $PSScriptRoot/*.csv | ForEach-Object {

    $csvFile = $_.FullName

    # Construct a file-path template for the sequentially numbered chunk
    # files; e.g., "...\file_split_001.csv"
    $csvFileChunkTemplate = $csvFile -replace '(.+)\.(.+)', '$1_split_{0:000}.$2'

    # Set how many lines make up a chunk.
    $chunkLineCount = 10000

    # Read the file lazily and save every chunk of $chunkLineCount
    # lines to a new file.
    $i = 0; $chunkNdx = 0
    foreach ($line in [IO.File]::ReadLines($csvFile)) {
        if ($i -eq 0) { ++$i; $header = $line; continue } # Save header line.
        if ($i++ % $chunkLineCount -eq 1) { # Create new chunk file.
            # Close previous file, if any.
            if (++$chunkNdx -gt 1) { $fileWriter.Dispose() }

            # Construct the file path for the next chunk, by
            # instantiating the template with the next sequence number.
            $csvFileChunk = $csvFileChunkTemplate -f $chunkNdx
            Write-Verbose "Creating chunk: $csvFileChunk"

            # Create the next chunk file and write the header.
            $fileWriter = [IO.File]::CreateText($csvFileChunk)
            $fileWriter.WriteLine($header)
        }
        # Write a data row to the current chunk file.
        $fileWriter.WriteLine($line)
    }
    $fileWriter.Dispose() # Close the last file.

}

Note that the above code creates BOM-less UTF-8 files; if your input contains ASCII-range characters only, these files will effectively be ASCII files.


Here's the equivalent single-pipeline solution, which is likely to be substantially slower.

Get-ChildItem $PSScriptRoot/*.csv | ForEach-Object {

    $csvFile = $_.FullName

    # Construct a file-path template for the sequentially numbered chunk
    # files; e.g., ".../file_split_001.csv"
    $csvFileChunkTemplate = $csvFile -replace '(.+)\.(.+)', '$1_split_{0:000}.$2'

    # Set how many lines make up a chunk.
    $chunkLineCount = 10000

    $i = 0; $chunkNdx = 0
    Get-Content -LiteralPath $csvFile | ForEach-Object {
        if ($i -eq 0) { ++$i; $header = $_; return } # Save header line.
        if ($i++ % $chunkLineCount -eq 1) { # 
            # Construct the file path for the next chunk.
            $csvFileChunk = $csvFileChunkTemplate -f ++$chunkNdx
            Write-Verbose "Creating chunk: $csvFileChunk"
            # Create the next chunk file and write the header.
            Set-Content -Encoding ASCII -LiteralPath $csvFileChunk -Value $header
        }
        # Write data row to the current chunk file.
        Add-Content -Encoding ASCII -LiteralPath $csvFileChunk -Value $_
    }

}

Upvotes: 3

Related Questions