Reputation: 123
I am trying to determine what Powershell command would be equivalent to the following Linux Command for creation of a large file in a reasonable time with exact size AND populated with the given text input.
Given:
$ cat line.txt
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ZZZZ
$ time yes `cat line.txt` | head -c 10GB > file.txt # create large file
real 0m59.741s
$ ls -lt file.txt
-rw-r--r--+ 1 k None 10000000000 Feb 2 16:28 file.txt
$ head -3 file.txt
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ZZZZ
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ZZZZ
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ZZZZ
What would be the most efficient, compact Powershell command which would allow me to specify the size, text content and create the file as the Linux command above? Thanks! Original ask here was automatically closed for some reason
Upvotes: 3
Views: 2244
Reputation: 437988
There is no direct PowerShell equivalent of your command.
In fact, with files of this size your best bet is to avoid PowerShell's own cmdlets and pipeline and to make direct use of .NET types instead:
& {
param($outFile, $size, $content)
# Add a newline to the input string, if needed.
$line = $content + "`n"
# Calculate how often the line must be repeated (including trailing newline)
# to reach the target size.
[long] $remainder = 0
$iterations = [math]::DivRem($size, $line.Length, [ref] $remainder)
# Create the output file.
$outFileInfo = New-Item -Force $outFile
$fs = [System.IO.StreamWriter] $outFileInfo.FullName
# Fill it with duplicates of the line.
foreach ($i in 1..$iterations) {
$fs.Write($line)
}
# If a partial line is needed to reach the exact target size, write it now.
if ($remainder) {
$fs.Write($line.Substring(0, $remainder))
}
$fs.Close()
} file.txt 1e10 (Get-Content line.txt)
Note: 1e10
uses PowerShell's support for scientific notation as a shorthand for 10000000000
(10,000,000,000
, i.e., [Math]::Pow(10, 10
). Note that PowerShell also has built-in support for byte-multiplier suffixes - kb
, mb
, gb
and tb
- but they are binary multipliers, so that 10gb
is equivalent to 10,737,418,240
(10 * [math]::Pow(1024, 3)
), not decimal 10,000,000,000
.
Note:
The size passed (1e10
in this case) is a character count, not a byte count. Given that .NET's file I/O APIs use BOM-less UTF-8 encoding by default, the two counts will only be equal if you restrict the input string to fill the file with to characters in the ASCII range (code points 0x0 - 0x7f
).
The last instance of the input string may be cut off (without a trailing newline) if the total characters count isn't an exact multiple of the input string length + 1 (for the newline).
Optimizing the performance of this code by up to 20% is possible, through a combination of writing bytes and output buffering, as shown in zett42's helpful answer.
The above performs reasonably well by PowerShell standards.
In general, PowerShell's object-oriented nature will never match the speed of the raw byte handling provided by native Unix utilities / shells.
It wouldn't be hard to turn the code above into a reusable function; in
a nutshell, replace & { ... }
with something like function New-FileOfSize { ... }
and call New-FileOfSize file.txt 1gb (Get-Content line.txt)
- see the conceptual about_Functions help topic, and about_Functions_Advanced for how to make the function more sophisticated.
Upvotes: 3
Reputation: 27766
A slightly optimized version of mklement0's script.
System.IO.FileStream
instead of System.IO.StreamWriter
to write raw bytes instead of a string which has to be encoded first.StreamWriter
which is rather small. A size of 1 MiB seems to be in the sweet spot on my machine. A 2 MiB buffer is already slower, propably due to worse caching behaviour. It may vary on your machine.$content
. If needed, it can be added to the argument by the user. To make this possible I have added argument -raw
to the Get-Content
call.& {
param($outFile, $size, $content)
# Encode the input string as UTF-8
$encoding = [Text.UTF8Encoding]::new()
$contentBytes = $encoding.GetBytes( $content )
# Calculate how often the content must be repeated (including trailing newline)
# to reach the target size.
[long] $remainder = 0
$iterations = [math]::DivRem($size, $contentBytes.Length, [ref] $remainder)
# Convert the PowerShell path to a full path for use by .NET API.
# .NET can't use a relative PowerShell path as its current directory may differ from
# PowerShells current directory.
$fullPath = $ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath( $outFile )
# Create a file stream with a large buffer size for improved performance.
$bufferSize = 1MB
$stream = [IO.FileStream]::new( $fullPath, [IO.FileMode]::Create, [IO.FileAccess]::Write,
[IO.FileShare]::Read, $bufferSize )
try {
# Fill it with duplicates of the content.
foreach ($i in 1..$iterations) {
$stream.Write($contentBytes, 0, $contentBytes.Length)
}
# If a sub string of the content is needed to reach the exact target size, write it now.
# Note this may create an invalid UTF-8 code point at the end, depending on
# the input. Basic ASCII is no problem.
if ($remainder) {
$stream.Write($contentBytes, 0, $remainder)
}
}
finally {
# Close the stream even when an exception has been thrown.
$stream.Close()
}
} file.txt 1gb (Get-Content -raw line.txt)
For testing the script was used to create a 1 GB file, with OPs test content (99 characters + LF). For each test, average MiB/s of 100 runs was calculated:
$duration = (1..100 | %{ (Measure-Command { .\Test.ps1 }).TotalSeconds } | Measure-Object -Average).Average
"$(1024 / $duration) MiB/s"
Test results:
Script | Buffer size | MiB/s |
---|---|---|
mklement0's script | default | 438 |
optimized script | 4 KiB | 434 |
optimized script | 16 KiB | 483 |
optimized script | 64 KiB | 521 |
optimized script | 256 KiB | 524 |
optimized script | 1 MiB | 528 |
optimized script | 2 MiB | 526 |
So in the best case we have a ~20% increase in performance. Not spectacular, but still noticable.
The values look quite good, when compared with SSD performance measured by winsat:
> winsat disk -seq -write -drive x
Disk Sequential 64.0 Write 496.03 MB/s
Upvotes: 1
Reputation: 16096
Continuing from my comment.
There is no command to do this. You have to code it.
Just from the info I point via the search. In PowerShell proper, a quick take on your use case would be like taking this approach.
Function New-EmptyFile
{
<#
.Synopsis
Create a new empty file
.DESCRIPTION
This function creates a new file of the given size
.EXAMPLE
New-EmptyFile -FilePath 'D:\Temp\nef.txt' -Size 10mb
.EXAMPLE
nef 'D:\Temp\nef.txt' 10mb
.NOTES
You can modify data in the file this way
(Get-Content -path 'D:\Temp\nef.txt' -Raw) -replace '\.*','white' |
Set-Content -Path 'D:\Temp\nef.txt'
#>
[cmdletbinding(SupportsShouldProcess)]
[Alias('nef')]
param
(
[string]$FilePath,
[double]$Size
)
$file = [System.IO.File]::Create($FilePath)
$file.SetLength($Size)
$file.Close()
Get-Item $file.Name
}
You could take this:
(Get-Content -path 'D:\Temp\nef.txt' -Raw) -replace '\.*','white' |
Set-Content -Path 'D:\Temp\nef.txt'
... and make it part of the function. Something like this:
Function New-EmptyFile
{
<#
.Synopsis
Create a new empty file
.DESCRIPTION
This function creates a new file of the given size
.EXAMPLE
New-EmptyFile -FilePath 'D:\Temp\nef.txt' -Size 10mb
.EXAMPLE
nef 'D:\Temp\nef.txt' 10mb
.NOTES
Other notes here
#>
[cmdletbinding(SupportsShouldProcess)]
[Alias('nef')]
param
(
[string]$FilePath,
[double]$Size,
[string]$FileData
)
$file = [System.IO.File]::Create($FilePath)
$file.SetLength($Size)
$file.Close()
Get-Item $file.Name
If ($FileData)
{
(Get-Content -Path (Get-Item $file.Name).FullName -Raw) -replace '\.*',$FileData |
Set-Content -Path (Get-Item $file.Name).FullName
}
}
New-EmptyFile -FilePath 'D:\Temp\nef.txt' -Size 10mb -FileData 'The quick brown fox.'
However, when dealing with large files, performance specifically means using the .Net namespace.
None of the above is an exact replacement of what you posted, so, you will need to tweak as needed.
See this write-up
Upvotes: 0