Reputation: 1048

Split CSV with powershell

I have large CSV files (50-500 MB each). Running complicated power shell commands on these takes forever and/or hits memory issues.

Processing the data requires grouping by common fields, say in ColumnA. So assuming that the data is already sorted by that column, if I split these files randomly (i.e. each x-thousand lines) then matching entries could still end up in different parts. There are thousands of different groups in A, so splitting every one into a single file would create to many files.

How can I split it into files of 10,000-ish lines and not lose the groups? E.g. rows 1-13 would be A1 in Column A, rows 14-17 would be A2 etc. and row 9997-10012 would be A784. In this case i would want the first file to contain rows 1-10012 and the next one to start with row 10013.

Obviously I would want to keep the entire rows (rather than just Column A), so if I pasted all the resulting files together this would be the same as the original file.

Upvotes: 3

Answers (4)

iRon

Reputation: 23830

To compliment the helpful answer from mjolinor with a reusable function with a few additional parameters and using the steppable pipeline which is about a factor 8 faster:

function Split-Content {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory=$true)][String]$Path,
        [ULong]$HeadSize,
        [ValidateRange(1, [ULong]::MaxValue)][ULong]$DataSize = [ULong]::MaxValue,
        [Parameter(Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]$Value
    )
    begin {
        $Header = [Collections.Generic.List[String]]::new()
        $DataCount = 0
        $PartNr = 1
    }
    Process {
        $ReadCount = 0
        while ($ReadCount -lt @($_).Count -and $Header.Count -lt $HeadSize) {
            if (@($_)[$ReadCount]) { $Header.Add(@($_)[$ReadCount]) }
            $ReadCount++
        }
        if ($ReadCount -lt @($_).Count -and $Header.Count -ge $HeadSize) {
            do {
                if ($DataCount -le 0) { # Should never be less
                    $FileInfo = [System.IO.FileInfo]$ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath($Path)
                    $FileName = $FileInfo.BaseName + $PartNr++ + $FileInfo.Extension
                    $LiteralPath = [System.IO.Path]::Combine($FileInfo.DirectoryName, $FileName)
                    $steppablePipeline = { Set-Content -LiteralPath $LiteralPath }.GetSteppablePipeline()
                    $steppablePipeline.Begin($PSCmdlet)
                    $steppablePipeline.Process($Header)
                }
                $Next = [math]::min(($DataSize - $DataCount), @($_).Count)
                if ($Next -gt $ReadCount) { $steppablePipeline.Process(@($_)[$ReadCount..($Next - 1)]) }
                $DataCount = ($DataCount + $Next - $ReadCount) % $DataSize
                if ($DataCount -le 0) { $steppablePipeline.End() }
                $ReadCount = $Next % @($_).Count
            } while ($ReadCount)
        } 
    }
    End {
        if ($steppablePipeline) { $steppablePipeline.End() }
    }
}

Parameters

Value
Specifies the listed content lines to be broken into parts. Multiple lines sent through the pipeline at a time (aka sub arrays like Object[]) will also be passed to the output file at a time (assuming that is fits the -DataSize).

Path
Specifies a path to one or more locations. Each filename in the location is suffixed with a part number (starting with 1).

HeadSize
The specifies the number of lines of the header that will be taken from the input and preceded in each file part. The default is 0, meaning no header line are copied.

DataSize
The specifies the number of lines that will be successively taken (after the header) from the input as data and pasted into each file part. The default is [ULong]::MaxValue, basically meaning that all data is copied to a single file.

Example 1:

Get-Content -ReadCount 1000 .\Test.Csv |Split-Content -Path .\Part.Csv -HeadSize 1 -DataSize 10000

This will split the .\Test.Csv file in chuncks of csv files with 10000 rows

Note that the performance of this Split-Content function highly depends on the -ReadCount of the prior Get-Content cmdlet.

Example 2:

Get-Process |Out-String -Stream |Split-Content -Path .\Process.Txt -HeadSize 2 -DataSize 20

This will write chunks of 20 processes to the .\Process<PartNr>.Txt files preceded with the standard (2 line) header format:

 NPM(K)    PM(M)      WS(M)     CPU(s)      Id  SI ProcessName
 ------    -----      -----     ------      --  -- -----------
 ... # 20 rows following

Upvotes: 0

Frode F.

Reputation: 54971

This is my attempt, it got messy :-P It will load the whole file into memory while splitting it, but this is pure text. It should take less memory then imported objects, but still about the size of the file.

$filepath = "C:\Users\graimer\Desktop\file.csv"
$file = Get-Item $filepath
$content = Get-Content $file
$csvheader = $content[0]
$lines = $content.Count
$minlines = 10000
$filepart = 1

$start = 1

while ($start -lt $lines - 1) {
    #Set minimum $end value (last line)
    if ($start + $minlines -le $lines - 1) { $end = $start + $minlines - 1 } else { $end = $lines - 1 }

    #Value to compare. ColA is first column in my file = [0] .  ColB is second column = [1]
    $avalue = $content[$end].split(",")[0]
    #If not last line in script
    if ($end -ne $lines -1) {
        #Increase $end by 1 while ColA is the same
        while ($content[$end].split(",")[0] -eq $avalue) { $end++ }
        #Return to last line with equal ColA value
        $end--
    }
    #Create new csv-part
    $filename = $file.FullName.Replace($file.BaseName, ($file.BaseName + ".part$filepart"))
    @($csvheader, $content[$start..$end]) | Set-Content $filename

    #Fix counters
    $filepart++
    $start = $end + 1
}

file.csv:

ColA,ColB,ColC
A1,1,10
A1,2,20
A1,3,30
A2,1,10
A2,2,20
A3,1,10
A4,1,10
A4,2,20
A4,3,30
A4,4,40
A4,5,50
A4,6,60
A5,1,10
A6,1,10
A7,1,10

Results (I used $minlines = 5):

file.part1.csv:

ColA,ColB,ColC
A1,1,10
A1,2,20
A1,3,30
A2,1,10
A2,2,20

file.part2.csv:

ColA,ColB,ColC
A3,1,10
A4,1,10
A4,2,20
A4,3,30
A4,4,40
A4,5,50
A4,6,60

file.part3.csv:

ColA,ColB,ColC
A5,1,10
A6,1,10
A7,1,10

Upvotes: 2

mjolinor

Reputation: 68341

Not tested. This assumes ColumnA is the first column and it's common comma-delimited data. You'll need to adjust the line that creates the regex to suit your data.

 $count = 0

 $header = get-content file.csv -TotalCount 1

 get-content file.csv -ReadCount 1000 |
  foreach {
   #add tail entries from last batch to beginning of this batch
   $newbatch = $tail + $_ 

   #create regex to match last entry in this batch
   $regex = '^' + [regex]::Escape(($newbatch[-1].split(',')[0])) 

   #Extract everything that doesn't match the last entry to new file

     #Add header if this is not the first file
     if ($count)
       {
         $header |
           set-content "c:\somedir\filepart_$count"
        }

     $newbatch -notmatch $regex | 
      add-content "c:\somedir\filepart_$count"  

   #Extact tail entries to add to next batch
   $tail = @($newbatch -match $regex)

   #Increment file counter
   $count++ 

}

Upvotes: 5

alroc

Reputation: 28194

This requires PowerShell v3 (due to -append on Export-CSV).

Also, I'm assuming that you have column headers and the first column is named col1. Adjust as necessary.

import-csv MYFILE.csv|foreach-object{$_|export-csv -notypeinfo -noclobber -append ($_.col1 + ".csv")}

This will create one file for each distinct value in the first column, with that value as the file name.