Reputation: 1724
Is there a fast implementation in Powershell to randomly shuffle and split a text file with 15 million rows using a 15%-85% split?
Many sources mention how to do it using Get-Content, but Get-Content and Get-Random is slow for large files:
Get-Content "largeFile.txt" | Sort-Object{Get-Random}| Out-file "shuffled.txt"
I was looking for solutions using Stream-Reader and Stream-Writer, but I'm not sure if it's possible. Linux bash seems to do this extremely fast for my file of 15million: How can I shuffle the lines of a text file on the Unix command line or in a shell script?
Upvotes: 2
Views: 3303
Reputation: 1724
I was trying to use stream reader/writer to not blow up my memory usage since some of these files are over 300MB large. I could not find a way to avoid memory completely, but instead of putting the file into memory, I create a random array of numbers between 0 and Total Lines. The array indicates which rows to put into the sample file.
Create Stream Reader for Data
$reader = New-Object -TypeName System.IO.StreamReader("data.txt");
Create Stream Writer for Test Population
$writer_stream = New-Object -TypeName System.IO.FileStream(
("test_population.txt"),
[System.IO.FileMode]::Create,
[System.IO.FileAccess]::Write);
$writer= New-Object -TypeName System.IO.StreamWriter(
$writer_stream,
[System.Text.Encoding]::ASCII);
Create Stream Writer for Control Group
$writer_stream_control = New-Object -TypeName System.IO.FileStream(
("control.txt"),
[System.IO.FileMode]::Create,
[System.IO.FileAccess]::Write);
$writer_control= New-Object -TypeName System.IO.StreamWriter(
$writer_stream_control,
[System.Text.Encoding]::ASCII);
Determine the control size and randomly choose numbers between 0 and the total number of rows in the file.
$line_count = 10000000
$control_percent = 0.15
$control_size = [math]::round($control_percent*$line_count)
Create an index of random numbers to determine which rows should go to sample file. Make sure to pipe through sort at the end.
$idx = Get-Random -count $control_size -InputObject(0..($line_count-1))|sort -Unique
denote $i as the line number; use $idx[$j] as the row that should go to the sample file
$i = 0; $j = 0
while ($reader.Peek() -ge 0) {
$line = $reader.ReadLine() #Read Line
if ($idx[$j] -eq $i){
$writer_control.WriteLine($OutPut)
$j++
}
else{$writer.WriteLine($OutPut)}
}
$i++
$reader.Close();
$reader.Dispose();
$writer.Flush();
$writer.Close();
$writer.Dispose();
$writer_control.Flush();
$writer_control.Close();
$writer_control.Dispose();
Upvotes: 0
Reputation: 68273
Not sure if this is will be sufficiently randomized/shuffled, but it should be faster:
$Idxs = 0..999
Get-Content "largeFile.txt" -ReadCount 1000 |
foreach {
$sample = Get-Random -InputObject $Idxs -Count 150
$_[$sample] |
Add-Content 'shuffled.txt'
}
Upvotes: 1