mojoa
mojoa

Reputation: 123

Powershell Large File Creation to Size with specified input Data

I am trying to determine what Powershell command would be equivalent to the following Linux Command for creation of a large file in a reasonable time with exact size AND populated with the given text input.

Given:

$ cat line.txt
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ZZZZ

$ time yes `cat line.txt` | head -c 10GB > file.txt  # create large file
real    0m59.741s

$ ls -lt file.txt
-rw-r--r--+ 1 k None 10000000000 Feb  2 16:28 file.txt

$ head -3 file.txt
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ZZZZ
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ZZZZ
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ZZZZ

What would be the most efficient, compact Powershell command which would allow me to specify the size, text content and create the file as the Linux command above? Thanks! Original ask here was automatically closed for some reason

Upvotes: 3

Views: 2244

Answers (3)

mklement0
mklement0

Reputation: 437988

There is no direct PowerShell equivalent of your command.

In fact, with files of this size your best bet is to avoid PowerShell's own cmdlets and pipeline and to make direct use of .NET types instead:

& {
  param($outFile, $size, $content)

  # Add a newline to the input string, if needed.
  $line = $content + "`n"

  # Calculate how often the line must be repeated (including trailing newline)
  # to reach the target size.
  [long] $remainder = 0
  $iterations = [math]::DivRem($size, $line.Length, [ref] $remainder)

  # Create the output file.
  $outFileInfo = New-Item -Force $outFile
  $fs = [System.IO.StreamWriter] $outFileInfo.FullName

  # Fill it with duplicates of the line.
  foreach ($i in 1..$iterations) {
    $fs.Write($line)
  }

  # If a partial line is needed to reach the exact target size, write it now.
  if ($remainder) {
    $fs.Write($line.Substring(0, $remainder))
  }

  $fs.Close()
  
} file.txt 1e10 (Get-Content line.txt)

Note: 1e10 uses PowerShell's support for scientific notation as a shorthand for 10000000000 (10,000,000,000, i.e., [Math]::Pow(10, 10). Note that PowerShell also has built-in support for byte-multiplier suffixes - kb, mb, gb and tb - but they are binary multipliers, so that 10gb is equivalent to 10,737,418,240 (10 * [math]::Pow(1024, 3)), not decimal 10,000,000,000.

Note:

  • The size passed (1e10 in this case) is a character count, not a byte count. Given that .NET's file I/O APIs use BOM-less UTF-8 encoding by default, the two counts will only be equal if you restrict the input string to fill the file with to characters in the ASCII range (code points 0x0 - 0x7f).

  • The last instance of the input string may be cut off (without a trailing newline) if the total characters count isn't an exact multiple of the input string length + 1 (for the newline).

  • Optimizing the performance of this code by up to 20% is possible, through a combination of writing bytes and output buffering, as shown in zett42's helpful answer.

The above performs reasonably well by PowerShell standards.

In general, PowerShell's object-oriented nature will never match the speed of the raw byte handling provided by native Unix utilities / shells.

It wouldn't be hard to turn the code above into a reusable function; in
a nutshell, replace & { ... } with something like function New-FileOfSize { ... } and call New-FileOfSize file.txt 1gb (Get-Content line.txt) - see the conceptual about_Functions help topic, and about_Functions_Advanced for how to make the function more sophisticated.

Upvotes: 3

zett42
zett42

Reputation: 27766

A slightly optimized version of mklement0's script.

  • Encode the string only once at the beginning.
  • Use System.IO.FileStream instead of System.IO.StreamWriter to write raw bytes instead of a string which has to be encoded first.
  • Use a larger buffer than the default one of StreamWriter which is rather small. A size of 1 MiB seems to be in the sweet spot on my machine. A 2 MiB buffer is already slower, propably due to worse caching behaviour. It may vary on your machine.
  • Unrelated to performance, a line feed character is no longer added to the input string $content. If needed, it can be added to the argument by the user. To make this possible I have added argument -raw to the Get-Content call.
& {
    param($outFile, $size, $content)
  
    # Encode the input string as UTF-8
    $encoding = [Text.UTF8Encoding]::new()
    $contentBytes = $encoding.GetBytes( $content )
  
    # Calculate how often the content must be repeated (including trailing newline)
    # to reach the target size.
    [long] $remainder = 0
    $iterations = [math]::DivRem($size, $contentBytes.Length, [ref] $remainder)
  
    # Convert the PowerShell path to a full path for use by .NET API.
    # .NET can't use a relative PowerShell path as its current directory may differ from
    # PowerShells current directory.
    $fullPath = $ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath( $outFile )

    # Create a file stream with a large buffer size for improved performance.
    $bufferSize = 1MB
    $stream = [IO.FileStream]::new( $fullPath, [IO.FileMode]::Create, [IO.FileAccess]::Write, 
                                    [IO.FileShare]::Read, $bufferSize )

    try {
        # Fill it with duplicates of the content.
        foreach ($i in 1..$iterations) {
            $stream.Write($contentBytes, 0, $contentBytes.Length)
        }
      
        # If a sub string of the content is needed to reach the exact target size, write it now. 
        # Note this may create an invalid UTF-8 code point at the end, depending on
        # the input. Basic ASCII is no problem.
        if ($remainder) {
            $stream.Write($contentBytes, 0, $remainder)
        } 
    }
    finally {
        # Close the stream even when an exception has been thrown.
        $stream.Close()
    }    
} file.txt 1gb (Get-Content -raw line.txt) 

For testing the script was used to create a 1 GB file, with OPs test content (99 characters + LF). For each test, average MiB/s of 100 runs was calculated:

$duration = (1..100 | %{ (Measure-Command { .\Test.ps1 }).TotalSeconds } | Measure-Object -Average).Average
"$(1024 / $duration) MiB/s"

Test results:

Script Buffer size MiB/s
mklement0's script default 438
optimized script 4 KiB 434
optimized script 16 KiB 483
optimized script 64 KiB 521
optimized script 256 KiB 524
optimized script 1 MiB 528
optimized script 2 MiB 526

So in the best case we have a ~20% increase in performance. Not spectacular, but still noticable.

The values look quite good, when compared with SSD performance measured by winsat:

> winsat disk -seq -write -drive x
Disk  Sequential 64.0 Write                  496.03 MB/s

Upvotes: 1

postanote
postanote

Reputation: 16096

Continuing from my comment.

There is no command to do this. You have to code it.

Just from the info I point via the search. In PowerShell proper, a quick take on your use case would be like taking this approach.

Function New-EmptyFile
{
<#
.Synopsis
    Create a new empty file 
.DESCRIPTION
    This function creates a new file of the given size
.EXAMPLE
    New-EmptyFile -FilePath 'D:\Temp\nef.txt' -Size 10mb

.EXAMPLE
    nef 'D:\Temp\nef.txt' 10mb

.NOTES
    You can modify data in the file this way
    (Get-Content -path 'D:\Temp\nef.txt' -Raw) -replace '\.*','white' | 
    Set-Content -Path 'D:\Temp\nef.txt'    
#>

    [cmdletbinding(SupportsShouldProcess)]
    [Alias('nef')]
    param
    (
        [string]$FilePath,
        [double]$Size
    )
 
    $file = [System.IO.File]::Create($FilePath)
    $file.SetLength($Size)
    $file.Close()

    Get-Item $file.Name
}

You could take this:

(Get-Content -path 'D:\Temp\nef.txt' -Raw) -replace '\.*','white' | 
Set-Content -Path 'D:\Temp\nef.txt'

... and make it part of the function. Something like this:

Function New-EmptyFile
{
<#
.Synopsis
    Create a new empty file 
.DESCRIPTION
    This function creates a new file of the given size
.EXAMPLE
    New-EmptyFile -FilePath 'D:\Temp\nef.txt' -Size 10mb

.EXAMPLE
    nef 'D:\Temp\nef.txt' 10mb

.NOTES
    Other notes here
 
#>

    [cmdletbinding(SupportsShouldProcess)]
    [Alias('nef')]
    param
    (
        [string]$FilePath,
        [double]$Size,
        [string]$FileData
    )
 
    $file = [System.IO.File]::Create($FilePath)
    $file.SetLength($Size)
    $file.Close()

    Get-Item $file.Name

    If ($FileData)
    {
        (Get-Content -Path (Get-Item $file.Name).FullName -Raw) -replace '\.*',$FileData | 
        Set-Content -Path (Get-Item $file.Name).FullName   
    }
}

New-EmptyFile -FilePath 'D:\Temp\nef.txt' -Size 10mb -FileData 'The quick brown fox.'

However, when dealing with large files, performance specifically means using the .Net namespace.

None of the above is an exact replacement of what you posted, so, you will need to tweak as needed.

See this write-up

Reading large text files with Powershell

Upvotes: 0

Related Questions