jgaw
jgaw

Reputation: 1724

Powershell random shuffle/split large text file

Is there a fast implementation in Powershell to randomly shuffle and split a text file with 15 million rows using a 15%-85% split?

Many sources mention how to do it using Get-Content, but Get-Content and Get-Random is slow for large files:

Get-Content "largeFile.txt" | Sort-Object{Get-Random}| Out-file "shuffled.txt"

I was looking for solutions using Stream-Reader and Stream-Writer, but I'm not sure if it's possible. Linux bash seems to do this extremely fast for my file of 15million: How can I shuffle the lines of a text file on the Unix command line or in a shell script?

Upvotes: 2

Views: 3303

Answers (2)

jgaw
jgaw

Reputation: 1724

I was trying to use stream reader/writer to not blow up my memory usage since some of these files are over 300MB large. I could not find a way to avoid memory completely, but instead of putting the file into memory, I create a random array of numbers between 0 and Total Lines. The array indicates which rows to put into the sample file.

Create Stream Reader for Data

$reader = New-Object -TypeName System.IO.StreamReader("data.txt");

Create Stream Writer for Test Population

$writer_stream = New-Object -TypeName System.IO.FileStream(
    ("test_population.txt"),
    [System.IO.FileMode]::Create,
    [System.IO.FileAccess]::Write);
$writer= New-Object -TypeName System.IO.StreamWriter(
    $writer_stream,
    [System.Text.Encoding]::ASCII);

Create Stream Writer for Control Group

$writer_stream_control = New-Object -TypeName System.IO.FileStream(
    ("control.txt"),
    [System.IO.FileMode]::Create,
    [System.IO.FileAccess]::Write);
$writer_control= New-Object -TypeName System.IO.StreamWriter(
    $writer_stream_control,
    [System.Text.Encoding]::ASCII);

Determine the control size and randomly choose numbers between 0 and the total number of rows in the file.

$line_count = 10000000
$control_percent = 0.15
$control_size = [math]::round($control_percent*$line_count)

Create an index of random numbers to determine which rows should go to sample file. Make sure to pipe through sort at the end.

$idx = Get-Random -count $control_size -InputObject(0..($line_count-1))|sort -Unique

denote $i as the line number; use $idx[$j] as the row that should go to the sample file

$i = 0; $j = 0
while ($reader.Peek() -ge 0) {    
    $line = $reader.ReadLine() #Read Line
    if ($idx[$j] -eq $i){
        $writer_control.WriteLine($OutPut)
        $j++
        }
    else{$writer.WriteLine($OutPut)}
    }
    $i++

$reader.Close();
$reader.Dispose();

$writer.Flush();
$writer.Close();
$writer.Dispose();

$writer_control.Flush();
$writer_control.Close();
$writer_control.Dispose();

Upvotes: 0

mjolinor
mjolinor

Reputation: 68273

Not sure if this is will be sufficiently randomized/shuffled, but it should be faster:

$Idxs = 0..999
Get-Content "largeFile.txt" -ReadCount 1000 | 
foreach {
 $sample = Get-Random -InputObject $Idxs  -Count 150
 $_[$sample] |
 Add-Content 'shuffled.txt'
 }

Upvotes: 1

Related Questions