Reputation: 1042
I have large CSV files (50-500 MB each). Running complicated power shell commands on these takes forever and/or hits memory issues.
Processing the data requires grouping by common fields, say in ColumnA. So assuming that the data is already sorted by that column, if I split these files randomly (i.e. each x-thousand lines) then matching entries could still end up in different parts. There are thousands of different groups in A, so splitting every one into a single file would create to many files.
How can I split it into files of 10,000-ish lines and not lose the groups? E.g. rows 1-13 would be A1 in Column A, rows 14-17 would be A2 etc. and row 9997-10012 would be A784. In this case i would want the first file to contain rows 1-10012 and the next one to start with row 10013.
Obviously I would want to keep the entire rows (rather than just Column A), so if I pasted all the resulting files together this would be the same as the original file.
Upvotes: 3
Views: 21638
Reputation: 23623
To compliment the helpful answer from mjolinor with a reusable function with a few additional parameters and using the steppable pipeline which is about a factor 8 faster:
function Split-Content {
[CmdletBinding()]
param (
[Parameter(Mandatory=$true)][String]$Path,
[ULong]$HeadSize,
[ValidateRange(1, [ULong]::MaxValue)][ULong]$DataSize = [ULong]::MaxValue,
[Parameter(Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]$Value
)
begin {
$Header = [Collections.Generic.List[String]]::new()
$DataCount = 0
$PartNr = 1
}
Process {
$ReadCount = 0
while ($ReadCount -lt @($_).Count -and $Header.Count -lt $HeadSize) {
if (@($_)[$ReadCount]) { $Header.Add(@($_)[$ReadCount]) }
$ReadCount++
}
if ($ReadCount -lt @($_).Count -and $Header.Count -ge $HeadSize) {
do {
if ($DataCount -le 0) { # Should never be less
$FileInfo = [System.IO.FileInfo]$ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath($Path)
$FileName = $FileInfo.BaseName + $PartNr++ + $FileInfo.Extension
$LiteralPath = [System.IO.Path]::Combine($FileInfo.DirectoryName, $FileName)
$steppablePipeline = { Set-Content -LiteralPath $LiteralPath }.GetSteppablePipeline()
$steppablePipeline.Begin($PSCmdlet)
$steppablePipeline.Process($Header)
}
$Next = [math]::min(($DataSize - $DataCount), @($_).Count)
if ($Next -gt $ReadCount) { $steppablePipeline.Process(@($_)[$ReadCount..($Next - 1)]) }
$DataCount = ($DataCount + $Next - $ReadCount) % $DataSize
if ($DataCount -le 0) { $steppablePipeline.End() }
$ReadCount = $Next % @($_).Count
} while ($ReadCount)
}
}
End {
if ($steppablePipeline) { $steppablePipeline.End() }
}
}
Value
Specifies the listed content lines to be broken into parts. Multiple lines sent through the pipeline at a time (aka sub arrays like Object[]
) will also be passed to the output file at a time (assuming that is fits the -DataSize
).
Path
Specifies a path to one or more locations. Each filename in the location is suffixed with a part number (starting with 1
).
HeadSize
The specifies the number of lines of the header that will be taken from the input and preceded in each file part. The default is 0
, meaning no header line are copied.
DataSize
The specifies the number of lines that will be successively taken (after the header) from the input as data and pasted into each file part. The default is [ULong]::MaxValue
, basically meaning that all data is copied to a single file.
Get-Content -ReadCount 1000 .\Test.Csv |Split-Content -Path .\Part.Csv -HeadSize 1 -DataSize 10000
This will split the .\Test.Csv
file in chuncks of csv files with 10000
rows
Note that the performance of this Split-Content
function highly depends on the -ReadCount
of the prior Get-Content
cmdlet.
Get-Process |Out-String -Stream |Split-Content -Path .\Process.Txt -HeadSize 2 -DataSize 20
This will write chunks of 20 processes to the .\Process<PartNr>.Txt
files preceded with the standard (2 line) header format:
NPM(K) PM(M) WS(M) CPU(s) Id SI ProcessName
------ ----- ----- ------ -- -- -----------
... # 20 rows following
Upvotes: 0
Reputation: 54821
This is my attempt, it got messy :-P It will load the whole file into memory while splitting it, but this is pure text. It should take less memory then imported objects, but still about the size of the file.
$filepath = "C:\Users\graimer\Desktop\file.csv"
$file = Get-Item $filepath
$content = Get-Content $file
$csvheader = $content[0]
$lines = $content.Count
$minlines = 10000
$filepart = 1
$start = 1
while ($start -lt $lines - 1) {
#Set minimum $end value (last line)
if ($start + $minlines -le $lines - 1) { $end = $start + $minlines - 1 } else { $end = $lines - 1 }
#Value to compare. ColA is first column in my file = [0] . ColB is second column = [1]
$avalue = $content[$end].split(",")[0]
#If not last line in script
if ($end -ne $lines -1) {
#Increase $end by 1 while ColA is the same
while ($content[$end].split(",")[0] -eq $avalue) { $end++ }
#Return to last line with equal ColA value
$end--
}
#Create new csv-part
$filename = $file.FullName.Replace($file.BaseName, ($file.BaseName + ".part$filepart"))
@($csvheader, $content[$start..$end]) | Set-Content $filename
#Fix counters
$filepart++
$start = $end + 1
}
file.csv:
ColA,ColB,ColC
A1,1,10
A1,2,20
A1,3,30
A2,1,10
A2,2,20
A3,1,10
A4,1,10
A4,2,20
A4,3,30
A4,4,40
A4,5,50
A4,6,60
A5,1,10
A6,1,10
A7,1,10
Results (I used $minlines = 5
):
file.part1.csv:
ColA,ColB,ColC
A1,1,10
A1,2,20
A1,3,30
A2,1,10
A2,2,20
file.part2.csv:
ColA,ColB,ColC
A3,1,10
A4,1,10
A4,2,20
A4,3,30
A4,4,40
A4,5,50
A4,6,60
file.part3.csv:
ColA,ColB,ColC
A5,1,10
A6,1,10
A7,1,10
Upvotes: 2
Reputation: 68243
Not tested. This assumes ColumnA is the first column and it's common comma-delimited data. You'll need to adjust the line that creates the regex to suit your data.
$count = 0
$header = get-content file.csv -TotalCount 1
get-content file.csv -ReadCount 1000 |
foreach {
#add tail entries from last batch to beginning of this batch
$newbatch = $tail + $_
#create regex to match last entry in this batch
$regex = '^' + [regex]::Escape(($newbatch[-1].split(',')[0]))
#Extract everything that doesn't match the last entry to new file
#Add header if this is not the first file
if ($count)
{
$header |
set-content "c:\somedir\filepart_$count"
}
$newbatch -notmatch $regex |
add-content "c:\somedir\filepart_$count"
#Extact tail entries to add to next batch
$tail = @($newbatch -match $regex)
#Increment file counter
$count++
}
Upvotes: 5
Reputation: 28144
This requires PowerShell v3 (due to -append
on Export-CSV
).
Also, I'm assuming that you have column headers and the first column is named col1
. Adjust as necessary.
import-csv MYFILE.csv|foreach-object{$_|export-csv -notypeinfo -noclobber -append ($_.col1 + ".csv")}
This will create one file for each distinct value in the first column, with that value as the file name.
Upvotes: 0