sourcenouveau
sourcenouveau

Reputation: 30504

Using PowerShell to write a file in UTF-8 without the BOM

Out-File seems to force the BOM when using UTF-8:

$MyFile = Get-Content $MyPath
$MyFile | Out-File -Encoding "UTF8" $MyPath

How can I write a file in UTF-8 with no BOM using PowerShell?

Update 2021

PowerShell has changed a bit since I wrote this question 10 years ago. Check multiple answers below, they have a lot of good information!

Upvotes: 397

Views: 460867

Answers (22)

Verity Freedom
Verity Freedom

Reputation: 13

I faced this issue for me, I myself am author of the program (that's more like a mode for the free software, so I can't just rewrite its core) which is sensitive to BOM. I've done some tests and asked a lot of questions. I needed a cmd script that would process one specific text file after it was encoded in utf8 for my program to work correctly in Cyrillic.

The best answer that I myself got is to use something like this:

powershell -Command "(gc '%CD%\myfile.txt') "^
...
"| Out-File -encoding utf8 '%CD%\myfile.txt'"
powershell "(get-content %CD%\myfile.txt -Encoding Byte) | select -skip 3 | set-content %CD%\myfile.txt -Encoding Byte"

By no means do I claim authorship of the method, thanks a lot to js2010 for the hint.

And I think this is good enough. The program wasn't starting at all with BOM, I checked, and now it started in latin directory. But for Cyrillic this didn't work, I think because the program itself don't support utf-8 Cyrillic representation.

The only thing that truly solved my problem was:

chcp 1251
powershell -Command "(gc '%CD%\myfile.txt') "^
...
"| Out-File -encoding default '%CD%\myfile.txt'"

By setting chcp 1251 the program finally understood Cyrillic (it became corrupted for Windows notepad for some reason but perfectly readable for my program), default in this situation returns the previously set value. We have expanded the cmd ASCII to ANSI and removed the BOM. If we need list of additional characters other than Cyrillic we can use chcp 1252 or any other.

I hope this solves your problem.

Upvotes: 0

mklement0
mklement0

Reputation: 436983

Note: This answer applies to Windows PowerShell (the legacy, ships-with-Windows, Windows-only edition of PowerShell whose latest and last version is 5.1); by contrast, in the cross-platform PowerShell (Core) 7 edition, UTF-8 without BOM is the default encoding, across all cmdlets.

  • In other words: If you're using PowerShell (Core) 7, i.e. version v7.x, you get BOM-less UTF-8 files by default (which you can also explicitly request with -Encoding utf8 / -Encoding utf8NoBOM, whereas you get with-BOM encoding with -utf8BOM).

  • If you're running Windows 10 or above and you're willing to switch to BOM-less UTF-8 encoding system-wide - which has far-reaching consequences, however - even Windows PowerShell can be made to use BOM-less UTF-8 consistently - see this answer.


To complement M. Dudley's own simple and pragmatic answer (and ForNeVeR's more concise reformulation):

  • A simple, (non-streaming) PowerShell-native alternative is to use New-Item, which (curiously) creates BOM-less UTF-8 files by default even in Windows PowerShell:

    # Note the use of -Raw to read the file as a whole.
    # Unlike with Set-Content / Out-File *no* trailing newline is appended.
    $null = New-Item -Force $MyPath -Value (Get-Content -Raw $MyPath)
    
    • Note: To save the output from arbitrary commands in the same format as Out-File would, pipe to Out-String first; e.g.:

       $null = New-Item -Force Out.txt -Value (Get-ChildItem | Out-String) 
      
  • For convenience, below is advanced custom function Out-FileUtf8NoBom, a pipeline-based alternative that mimics Out-File, which means:

    • you can use it just like Out-File in a pipeline.
    • input objects that aren't strings are formatted as they would be if you sent them to the console, just like with Out-File.
    • an additional -UseLF switch allows you use Unix-format LF-only newlines ("`n") instead of the Windows-format CRLF newlines ("`r`n") you normally get.

Example:

(Get-Content $MyPath) | Out-FileUtf8NoBom $MyPath # Add -UseLF for Unix newlines

Note how (Get-Content $MyPath) is enclosed in (...), which ensures that the entire file is opened, read in full, and closed before sending the result through the pipeline. This is necessary in order to be able to write back to the same file (update it in place).
Generally, though, this technique is not advisable for 2 reasons: (a) the whole file must fit into memory and (b) if the command is interrupted, data will be lost.

A note on memory use:

  • M. Dudley's own answer and the New-Item alternative above require that the entire file contents be built up in memory first, which can be problematic with large input sets.
  • The function below does not require this, because it is implemented as a proxy (wrapper) function (for a concise summary of how to define such functions, see this answer).

Source code of function Out-FileUtf8NoBom:

Note: The function is also available as an MIT-licensed Gist, and only the latter will be maintained going forward.

You can install it directly with the following command (while I can personally assure you that doing so is safe, you should always check the content of a script before directly executing it this way):

# Download and define the function.
irm https://gist.github.com/mklement0/8689b9b5123a9ba11df7214f82a673be/raw/Out-FileUtf8NoBom.ps1 | iex
function Out-FileUtf8NoBom {

  <#
  .SYNOPSIS
    Outputs to a UTF-8-encoded file *without a BOM* (byte-order mark).

  .DESCRIPTION

    Mimics the most important aspects of Out-File:
      * Input objects are sent to Out-String first.
      * -Append allows you to append to an existing file, -NoClobber prevents
        overwriting of an existing file.
      * -Width allows you to specify the line width for the text representations
        of input objects that aren't strings.
    However, it is not a complete implementation of all Out-File parameters:
      * Only a literal output path is supported, and only as a parameter.
      * -Force is not supported.
      * Conversely, an extra -UseLF switch is supported for using LF-only newlines.

  .NOTES
    The raison d'être for this advanced function is that Windows PowerShell
    lacks the ability to write UTF-8 files without a BOM: using -Encoding UTF8 
    invariably prepends a BOM.

    Copyright (c) 2017, 2022 Michael Klement <[email protected]> (http://same2u.net), 
    released under the [MIT license](https://spdx.org/licenses/MIT#licenseText).

  #>

  [CmdletBinding(PositionalBinding=$false)]
  param(
    [Parameter(Mandatory, Position = 0)] [string] $LiteralPath,
    [switch] $Append,
    [switch] $NoClobber,
    [AllowNull()] [int] $Width,
    [switch] $UseLF,
    [Parameter(ValueFromPipeline)] $InputObject
  )

  begin {

    # Convert the input path to a full one, since .NET's working dir. usually
    # differs from PowerShell's.
    $dir = Split-Path -LiteralPath $LiteralPath
    if ($dir) { $dir = Convert-Path -ErrorAction Stop -LiteralPath $dir } else { $dir = $pwd.ProviderPath }
    $LiteralPath = [IO.Path]::Combine($dir, [IO.Path]::GetFileName($LiteralPath))
    
    # If -NoClobber was specified, throw an exception if the target file already
    # exists.
    if ($NoClobber -and (Test-Path $LiteralPath)) {
      Throw [IO.IOException] "The file '$LiteralPath' already exists."
    }
    
    # Create a StreamWriter object.
    # Note that we take advantage of the fact that the StreamWriter class by default:
    # - uses UTF-8 encoding
    # - without a BOM.
    $sw = New-Object System.IO.StreamWriter $LiteralPath, $Append
    
    $htOutStringArgs = @{}
    if ($Width) { $htOutStringArgs += @{ Width = $Width } }

    try { 
      # Create the script block with the command to use in the steppable pipeline.
      $scriptCmd = { 
        & Microsoft.PowerShell.Utility\Out-String -Stream @htOutStringArgs | 
          . { process { if ($UseLF) { $sw.Write(($_ + "`n")) } else { $sw.WriteLine($_) } } }
      }  
      
      $steppablePipeline = $scriptCmd.GetSteppablePipeline($myInvocation.CommandOrigin)
      $steppablePipeline.Begin($PSCmdlet)
    }
    catch { throw }

  }

  process
  {
    $steppablePipeline.Process($_)
  }

  end {
    $steppablePipeline.End()
    $sw.Dispose()
  }

}

Upvotes: 67

FooFoo
FooFoo

Reputation: 21

I have created the following code for easy logging. It creates an UTF-8 file without BOM when executed in PowerShell 5. It works as expected for me. Feel free to customize it to your needs :-)

Function myWriteLog{
# $LogFilePath has to be defined before calling the function 
# And, "$SciptName=$MyInvocation.MyCommand.Name" has to be set before calling the function 
    Param( [Parameter(Mandatory=$true, ValueFromPipeline=$true)]
           [string]$content
         )

# disallow a NULL or an EMPTY value 
if ([string]::IsNullOrEmpty($content.Trim())){
    throw "Found 'EMPTY or NULL': you must provide a nonNull and nonEmpty string to function ""myWriteLog"""
    return 0
} else { 
    if((Test-Path $LogFilePath) -eq $False){

        # Creates the file, please note that option  "-NoNewline" has to be set
        "" | Out-file -FilePath $LogFilePath -Encoding ascii -Force -NoNewline
    
        # Create a string as a line separator for a file header
        $t ="".PadLeft(("Logfile for : $SciptName").Length,"#")
        Add-Content -path $LogFilePath -value "$t"
        Add-Content -path $LogFilePath -value "Logfile for : $SciptName"
        Add-Content -path $LogFilePath -value "LogFile Created: $(Get-date -F "yyyy-MM-dd-HH-mm-ss")"
        Add-Content -path $LogFilePath -value "$t"
        Add-Content -path $LogFilePath -value ""
        
        #and now add the content to 
        Add-Content -path $LogFilePath -value "$(Get-date -F "yyyy-MM-dd-HH-mm-ss") : $content" -Encoding UTF8 -force
    }else{
        Add-Content -path $LogFilePath -value "$(Get-date -F "yyyy-MM-dd-HH-mm-ss") : $content" -Encoding UTF8 -force
    }
}

}

Upvotes: 2

Lucero
Lucero

Reputation: 60190

Here's an alternative to the accepted answer.

The advantage of this approach is that it's compatible with IO.FileInfo objects (from functions like Get-Item) and relative paths.

  1. Create a Text.UTF8Encoding object

    • While Text.UTF8Encoding is capable of inserting a BOM, it doesn't by default
  2. Call the object's GetBytes method to convert a string into bytes

    • Ensure the target string isn't actually a string array$stringVar.Count should equal 1
  3. Write the byte array to your target with Set-Content -Encoding Byte

# This is a reusable class instance object
$utf8 = New-Object Text.UTF8Encoding

$GCRaw = Get-Content -Raw -PSPath $MyPath
Set-Content -Value $utf8.GetBytes($GCRaw) -Encoding Byte -PSPath $MyPath

This can be shortened by letting -Value be inferred by position and, additionally, by creating the Text.UTF8Encoding object from within the argument.

$GCRaw = Get-Content $MyPath -Raw

Set-Content ([Text.UTF8Encoding]::new().GetBytes($GCRaw)) -Encoding Byte -PSPath $MyPath

#NOTE#
# (New-Object Text.UTF8Encoding).GetBytes($GCRaw))
# can be used instead of
# ([Text.UTF8Encoding]::new().GetBytes($GCRaw))
# For code intended to be compact, I recommend the latter,
# not just because it's not as long, but also because its
# lack of whitespace makes it visually more distinct.

Upvotes: 21

LPChip
LPChip

Reputation: 884

If your first line does not contain anything fancy that doesn't require UTF8, the following will create an UTF8 file without BOM on stock Windows 10 Powershell:

$file = get-content -path "C:\temp\myfile.txt" -Encoding UTF8

# do some stuff.

$file[0] | out-file "C:\temp\mynewfile.txt" -Encoding ascii
$file | select -skip 1 | out-file "C:\temp\mynewfile.txt" -append utf8

This uses 2 lines to create the new file. The first one uses -encoding ascii to force UTF8, but it will be limited to 7-bit ascii. With a textfile, this is usually not an issue, otherwise you'd probably choose byte encoding anyway.

The second command appends the rest, but skips the first line as we already parsed that one with full UTF8 support.

Upvotes: 1

Pravanjan Hota
Pravanjan Hota

Reputation: 21

I would say to use just the Set-Content command, nothing else needed.

The powershell version in my system is :-

PS C:\Users\XXXXX> $PSVersionTable.PSVersion | fl


Major         : 5
Minor         : 1
Build         : 19041
Revision      : 1682
MajorRevision : 0
MinorRevision : 1682

PS C:\Users\XXXXX>

So you would need something like following.

PS C:\Users\XXXXX> Get-Content .\Downloads\finddate.txt
Thursday, June 23, 2022 5:57:59 PM
PS C:\Users\XXXXX> Get-Content .\Downloads\finddate.txt | Set-Content .\Downloads\anotherfile.txt
PS C:\Users\XXXXX> Get-Content .\Downloads\anotherfile.txt
Thursday, June 23, 2022 5:57:59 PM
PS C:\Users\XXXXX>

Now when we check the file as per the screenshot it is utf8. anotherfile.txt

PS: To answer on the comment query on foreign character issue. The contents from file "testfgnchar.txt" which having the foreign characters, was copied to "findfnchar2.txt" using the following command.

PS C:\Users\XXXXX> Get-Content .\testfgnchar.txt | Set-Content findfnchar2.txt
PS C:\Users\XXXXX>

screen-shot is here.

Note: Currently, there are newer versions of PowerShell exists, than the one I used during answer.

Upvotes: 2

Tanmay Sarin
Tanmay Sarin

Reputation: 21

Used this method to edit a UTF8-NoBOM file and generated a file with correct encoding-

$fileD = "file.xml"
(Get-Content $fileD) | ForEach-Object { $_ -replace 'replace text',"new text" } | out-file "file.xml" -encoding ASCII

I was skeptical at this method at first, but it surprised me and worked!

Tested with powershell version 5.1

Upvotes: 1

Erik Anderson
Erik Anderson

Reputation: 5289

One technique I utilize is to redirect output to an ASCII file using the Out-File cmdlet.

For example, I often run SQL scripts that create another SQL script to execute in Oracle. With simple redirection (">"), the output will be in UTF-16 which is not recognized by SQLPlus. To work around this:

sqlplus -s / as sysdba "@create_sql_script.sql" |
Out-File -FilePath new_script.sql -Encoding ASCII -Force

The generated script can then be executed via another SQLPlus session without any Unicode worries:

sqlplus / as sysdba "@new_script.sql" |
tee new_script.log

Update: As others have pointed out, this will drop non-ASCII characters. Since the user asked for a way to "force" conversion, I assume they do not care about that as perhaps their data does not contain such data.

If you care about the preservation of non-ASCII characters, this is not the answer for you.

Upvotes: 1

Nader Gharibian Fard
Nader Gharibian Fard

Reputation: 8085

I have the same error in the PowerShell and used this isolation and fixed it

$PSDefaultParameterValues['*:Encoding'] = 'utf8'

Upvotes: 0

JensG
JensG

Reputation: 13401

Old question, new answer:

While the "old" powershell writes a BOM, the new platform-agnostic variant does behave differently: The default is "no BOM" and it can be configured via switch:

-Encoding

Specifies the type of encoding for the target file. The default value is utf8NoBOM.

The acceptable values for this parameter are as follows:

  • ascii: Uses the encoding for the ASCII (7-bit) character set.
  • bigendianunicode: Encodes in UTF-16 format using the big-endian byte order.
  • oem: Uses the default encoding for MS-DOS and console programs.
  • unicode: Encodes in UTF-16 format using the little-endian byte order.
  • utf7: Encodes in UTF-7 format.
  • utf8: Encodes in UTF-8 format.
  • utf8BOM: Encodes in UTF-8 format with Byte Order Mark (BOM)
  • utf8NoBOM: Encodes in UTF-8 format without Byte Order Mark (BOM)
  • utf32: Encodes in UTF-32 format.

Source: https://learn.microsoft.com/de-de/powershell/module/Microsoft.PowerShell.Utility/Out-File?view=powershell-7 Emphasis mine

Upvotes: 8

Andreas Covidiot
Andreas Covidiot

Reputation: 4745

important!: this only works if an extra space or newline at the start is no problem for your use case of the file
(e.g. if it is an SQL file, Java file or human readable text file)

one could use a combination of creating an empty (non-UTF8 or ASCII (UTF8-compatible)) file and appending to it (replace $str with gc $src if the source is a file):

" "    |  out-file  -encoding ASCII  -noNewline  $dest
$str  |  out-file  -encoding UTF8   -append     $dest

as one-liner

replace $dest and $str according to your use case:

$_ofdst = $dest ; " " | out-file -encoding ASCII -noNewline $_ofdst ; $src | out-file -encoding UTF8 -append $_ofdst

as simple function

function Out-File-UTF8-noBOM { param( $str, $dest )
  " "    |  out-file  -encoding ASCII  -noNewline  $dest
  $str  |  out-file  -encoding UTF8   -append     $dest
}

using it with a source file:

Out-File-UTF8-noBOM  (gc $src),  $dest

using it with a string:

Out-File-UTF8-noBOM  $str,  $dest
  • optionally: continue appending with Out-File:

    "more foo bar"  |  Out-File -encoding UTF8 -append  $dest
    

Upvotes: 8

Zombo
Zombo

Reputation: 1

For PowerShell 5.1, enable this setting:

Control Panel, Region, Administrative, Change system locale, Use Unicode UTF-8 for worldwide language support

Then enter this into PowerShell:

$PSDefaultParameterValues['*:Encoding'] = 'Default'

Alternatively, you can upgrade to PowerShell 6 or higher.

https://github.com/PowerShell/PowerShell

Upvotes: 4

sourcenouveau
sourcenouveau

Reputation: 30504

Using .NET's UTF8Encoding class and passing $False to the constructor seems to work:

$MyRawString = Get-Content -Raw $MyPath
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
[System.IO.File]::WriteAllLines($MyPath, $MyRawString, $Utf8NoBomEncoding)

Upvotes: 314

sc911
sc911

Reputation: 1290

Starting from version 6 powershell supports the UTF8NoBOM encoding both for set-content and out-file and even uses this as default encoding.

So in the above example it should simply be like this:

$MyFile | Out-File -Encoding UTF8NoBOM $MyPath

Upvotes: 36

SATO Yusuke
SATO Yusuke

Reputation: 2184

If you want to use [System.IO.File]::WriteAllLines(), you should cast second parameter to String[] (if the type of $MyFile is Object[]), and also specify absolute path with $ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath($MyPath), like:

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
Get-ChildItem | ConvertTo-Csv | Set-Variable MyFile
[System.IO.File]::WriteAllLines($ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath($MyPath), [String[]]$MyFile, $Utf8NoBomEncoding)

If you want to use [System.IO.File]::WriteAllText(), sometimes you should pipe the second parameter into | Out-String | to add CRLFs to the end of each line explictly (Especially when you use them with ConvertTo-Csv):

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
Get-ChildItem | ConvertTo-Csv | Out-String | Set-Variable tmp
[System.IO.File]::WriteAllText("/absolute/path/to/foobar.csv", $tmp, $Utf8NoBomEncoding)

Or you can use [Text.Encoding]::UTF8.GetBytes() with Set-Content -Encoding Byte:

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
Get-ChildItem | ConvertTo-Csv | Out-String | % { [Text.Encoding]::UTF8.GetBytes($_) } | Set-Content -Encoding Byte -Path "/absolute/path/to/foobar.csv"

see: How to write result of ConvertTo-Csv to a file in UTF-8 without BOM

Upvotes: 2

frank tan
frank tan

Reputation: 141

    [System.IO.FileInfo] $file = Get-Item -Path $FilePath 
    $sequenceBOM = New-Object System.Byte[] 3 
    $reader = $file.OpenRead() 
    $bytesRead = $reader.Read($sequenceBOM, 0, 3) 
    $reader.Dispose() 
    #A UTF-8+BOM string will start with the three following bytes. Hex: 0xEF0xBB0xBF, Decimal: 239 187 191 
    if ($bytesRead -eq 3 -and $sequenceBOM[0] -eq 239 -and $sequenceBOM[1] -eq 187 -and $sequenceBOM[2] -eq 191) 
    { 
        $utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False) 
        [System.IO.File]::WriteAllLines($FilePath, (Get-Content $FilePath), $utf8NoBomEncoding) 
        Write-Host "Remove UTF-8 BOM successfully" 
    } 
    Else 
    { 
        Write-Warning "Not UTF-8 BOM file" 
    }  

Source How to remove UTF8 Byte Order Mark (BOM) from a file using PowerShell

Upvotes: 2

Lenny
Lenny

Reputation: 5939

I figured this wouldn't be UTF, but I just found a pretty simple solution that seems to work...

Get-Content path/to/file.ext | out-file -encoding ASCII targetFile.ext

For me this results in a utf-8 without bom file regardless of the source format.

Upvotes: 81

Jaume Su&#241;er Mut
Jaume Su&#241;er Mut

Reputation: 401

Change multiple files by extension to UTF-8 without BOM:

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
foreach($i in ls -recurse -filter "*.java") {
    $MyFile = Get-Content $i.fullname 
    [System.IO.File]::WriteAllLines($i.fullname, $MyFile, $Utf8NoBomEncoding)
}

Upvotes: 1

ForNeVeR
ForNeVeR

Reputation: 6956

The proper way as of now is to use a solution recommended by @Roman Kuzmin in comments to @M. Dudley answer:

[IO.File]::WriteAllLines($filename, $content)

(I've also shortened it a bit by stripping unnecessary System namespace clarification - it will be substituted automatically by default.)

Upvotes: 119

Robin Wang
Robin Wang

Reputation: 839

Could use below to get UTF8 without BOM

$MyFile | Out-File -Encoding ASCII

Upvotes: -3

Krzysztof
Krzysztof

Reputation: 21

This one works for me (use "Default" instead of "UTF8"):

$MyFile = Get-Content $MyPath
$MyFile | Out-File -Encoding "Default" $MyPath

The result is ASCII without BOM.

Upvotes: -4

jamhan
jamhan

Reputation: 3120

This script will convert, to UTF-8 without BOM, all .txt files in DIRECTORY1 and output them to DIRECTORY2

foreach ($i in ls -name DIRECTORY1\*.txt)
{
    $file_content = Get-Content "DIRECTORY1\$i";
    [System.IO.File]::WriteAllLines("DIRECTORY2\$i", $file_content);
}

Upvotes: 6

Related Questions