Convert sources to UTF-8 without BOM

Question

I try to convert all my source files from a target folder to the UTF-8 (without BOM) encoding. I use the following PowerShell script:

$MyPath = "D:\my projects\etc\"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
    $content = Get-Content $_.FullName  
    $Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
    [System.IO.File]::WriteAllLines($_.FullName, $content, $Utf8NoBomEncoding)    
}
cmd /c pause | out-null

it works fine if the files are already not in UTF-8. But if some file was already in UTF-8 no-BOM, all the national symbols become converted to unknown symbols (for example if I run the script again). How the script can be changed to fix the problem?

mklement0 · Accepted Answer

As Ansgar Wiechers points out in a comment, the problem is that Windows PowerShell, in the absence of a BOM, defaults to interpreting files as "ANSI"-encoded, i.e., the encoding implied by the legacy system locale (ANSI code page), as reflected by the .NET Framework (but not .NET Core) in [System.Text.Encoding]::Default.

Given that, based on your follow-up comments, the BOM-less files among your input files are a mix of Windows-1251-encoded and UTF-8 files, you must examine their content to determine their specific encoding:

Read each file with -Encoding Utf8 and test if the resulting string contains the Unicode REPLACEMENT CHARACTER (U+FFFD). If it does, the implication is that the file is not UTF-8, because this special character is used to signal that byte sequences were encountered that aren't valid in UTF-8.
If the file isn't valid UTF-8, simply read the file again without specifying -Encoding, which causes Windows PowerShell to interpret the file as Windows-1251-encoded, given that that is the encoding (code page) implied by your system locale.

$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
    # Note:
    #  * the use of -Encoding Utf8 to first try to read the file as UTF-8.
    #  * the use of -Raw to read the entire file as a *single string*.
    $content = Get-Content -Raw -Encoding Utf8 $_.FullName  

    # If the replacement char. is found in the content, the implication
    # is that the file is NOT UTF-8, so read it again *without -Encoding*,
    # which interprets the files as "ANSI" encoded (Windows-1251, in your case).
    if ($content.Contains([char] 0xfffd)) {
      $content = Get-Content -Raw $_.FullName  
    }

    # Note the use of WriteAllText() in lieu of WriteAllLines()
    # and that no explicit encoding object is passed, given that
    # .NET *defaults* to BOM-less UTF-8.
    # CAVEAT: There's a slight risk of data loss if writing back to the input
    #         file is interrupted.
    [System.IO.File]::WriteAllText($_.FullName, $content)    
}

A faster alternative is to use [IO.File]::ReadAllText() with a UTF-8 encoding object that throws an exception when invalid-as-UTF-8 bytes are encountered (PSv5+ syntax):

$utf8EncodingThatThrows = [Text.UTF8Encoding]::new($false, $true)

# ...

  try {
     $content = [IO.File]::ReadAllText($_.FullName, $utf8EncodingThatThrows)
  } catch [Text.DecoderFallbackException] {         
     $content = [IO.File]::ReadAllText($_.FullName, [Text.Encoding]::Default)
  }

# ...

Adapting the above solutions to PowerShell Core / .NET Core:

PowerShell Core defaults to (BOM-less) UTF-8, so simply omitting -Encoding doesn't work for reading ANSI-encoded files.
Similarly, [System.Text.Encoding]::Default invariably reports UTF-8 in .NET Core.

Therefore, you must manually determine the active system locale's ANSI code page and obtain the corresponding encoding object:

$ansiEncoding = [Text.Encoding]::GetEncoding(
  [int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP)
)

You then need to pass this encoding explicitly to Get-Content -Encoding ( Get-Content -Raw -Encoding $ansiEncoding $_.FullName) or to the .NET methods ([IO.File]::ReadAllText($_.FullName, $ansiEncoding)).

Original form of the answer: for input files that are all UTF-8-encoded already:

Therefore, if some of your UTF-8-encoded files (already) are BOM-less, you must explicitly instruct Get-Content to treat them as UTF-8, using -Encoding Utf8 - otherwise they will be misinterpreted, if they contain characters outside the 7-bit ASCII range:

$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
    # Note:
    #  * the use of -Encoding Utf8 to ensure the correct interpretation of the input file
    #  * the use of -Raw to read the entire file as a *single string*.
    $content = Get-Content -Raw -Encoding Utf8 $_.FullName  

    # Note the use of WriteAllText() in lieu of WriteAllLines()
    # and that no explicit encoding object is passed, given that
    # .NET *defaults* to BOM-less UTF-8.
    # CAVEAT: There's a slight risk of data loss if writing back to the input
    #         file is interrupted.
    [System.IO.File]::WriteAllText($_.FullName, $content)    
}

Note: BOM-less UTF-8 files do not need rewriting in your scenario, but doing so is benign and simplifies the code; the alternative would be to test if the first 3 bytes of each file are the UTF-8 BOM and skip such a file:
$hasUtf8Bom = "$(Get-Content -Encoding Byte -First 3 $_.FullName)" -eq '239 187 191' (Windows PowerShell) or
$hasUtf8Bom = "$(Get-Content -AsByteStream -First 3 $_.FullName)" -eq '239 187 191' (PowerShell Core).

As an aside: Should there be input files with a non-UTF8 encoding (e.g., UTF-16), the solution still works as long as these files have a BOM, because PowerShell (quietly) gives precedence to a BOM over the encoding specified via -Encoding.

Note that using -Raw / WriteAllText() to read / write the files as a whole (single string) not only speeds up processing a little, but ensures that the following characteristics of each input file are preserved:

the specific newline style (CRLF (Windows) vs. LF-only (Unix))
whether the last line has a trailing newline.

By contrast, not using -Raw (line-by-line reading) and using .WriteAllLines() does not preserve these characteristics: you invariably get the platform-appropriate newlines (in Windows PowerShell, always CRLF) and you always get at trailing newline.

Note that the multi-platform Powershell Core edition sensibly defaults to UTF-8 when reading a file without a BOM and also by default creates BOM-less UTF-8 files - creating a UTF-8 file with BOM requires explicit opt-in with -Encoding utf8BOM.

Therefore, a PowerShell Core solution is much simpler:

# PowerShell Core only.

$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
    # * Read the file at hand (UTF8 files both with and without BOM are 
    #   read correctly).
    # * Simply rewrite it with the *default* encoding, which in 
    #   PowerShell Core is BOM-less UTF-8.
    # Note the (...) around the Get-Content call, which is necessary in order
    # to write back to the *same* file in the same pipeline.
    # CAVEAT: There's a slight risk of data loss if writing back to the input
    #         file is interrupted.
    (Get-Content -Raw $_.FullName) | Set-Content -NoNewline $_.FullName
}

Faster .NET-type-based solution

The above solutions work, but Get-Content and Set-Content are relatively slow, so using .NET types to both read and rewrite the files will perform better.

As above, no encoding must explicitly be specified in the following solution (not even in Windows PowerShell), because .NET itself has commendably defaulted to BOM-less UTF-8 since its inception (while still recognizing a UTF-8 BOM if present):

$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
  # CAVEAT: There's a slight risk of data loss if writing back to the input
  #         file is interrupted.
  [System.IO.File]::WriteAllText(
    $_.FullName,
    [System.IO.File]::ReadAllText($_.FullName)
  )   
}

Convert sources to UTF-8 without BOM

Answers (2)

Original form of the answer: for input files that are all UTF-8-encoded already:

Faster .NET-type-based solution

Related Questions