ImperatorMing
ImperatorMing

Reputation: 57

Encoding with powershell

I have the following problem: Imagine I have a UTF8 file where every Special character is symbolized by the REPLACEMENT_CHARACTER "�". Some part of the file could look like:

Das hier r�ckg�ngig ist das zu machen r�ckg�ngig : ist bereits geamcht Weitere W�rter gibt ers zu korrigieren Hier noch ein bl�des Wort zwei in einer Zeile G�hte und Gr��e

I wrote a PowerShell script which replaces the REPLACEMENT_CHARCTERS by the corresponding Special characters, for example "a", "ü" or "ß". The corrected text, also UTF8, will look like:

Das hier rückgängig ist das zu machen rückgängig : ist bereits geamcht Weitere Wörter gibt ers zu korrigieren Hier noch ein blödes Wort zwei in einer Zeile Göhte und Größe

The Problem is that the program where I want the text to Import to only takes "Wester European DOS (CP850)" encoded files. By the way, that was the original coding which the program has been exported and would have imported without problems if I hadn't opened the file, edited it and saved it in UTF8. So here is what happend:

  1. I exported files from a specific program as "Wester European DOS (CP850)". [Note: Every Special character has its own REPLACEMENT CHARACTER here, so an Import would work easy and restore the Special characters]

  2. I opened the file with an Editor of my choice and the Editor detected "UTF8" on its own which is not correct. I did not recognize, edit the file and saved it as UTF8. [Now every Special character has the same REPLACEMENT CHARACTER, its �]

  3. I have recognized that there is something wrong and wrote a script which replaces every occurrence of � by the right Special Character in UTF8. [I think it doesnt matter how the script does this, but if so, ask]

  4. I have the corrected UTF8 File, but as you remember, I have to Import "Wester European DOS (CP850)" to my program. The same File Encoding as it has exported the file. This Encoding ensures that every Special character has its own unique REPLACEMENT_CHARACTER. So how do i got back to this by PowerShell?

Here are some more Information. The Line in which the script Reads in the file i want to correct is:

$lines = get-content $file -encoding utf8 | select-string $SearchCharacter

The algorithm runs through every line and asks for any wrong word with the character for a correction and skips it if it is found again. After all corrections from all files have been found, it replaces in a loop the occurrences from every "key" (wrong word) to every "value" (corrected word) in each file with this line:

foreach key ...
(Get-Content -encoding utf8 $file) -replace "$key", "$value" | Set-Content -encoding utf8 $file

I already tried to do something like that:

foreach key ...
(Get-Content -encoding utf8 $file) -replace "$key", "$value" | Set-Content -encoding OEM $file

But this results in using "?" instead of the correct character:

Das hier r?ckg?ngig ist das zu machen r?ckg?ngig : ist bereits geamcht Weitere W?rter gibt ers zu korrigieren Hier noch ein bl?des Wort zwei in einer Zeile G?hte und Gr??e

Any suggestions how i can build an "Wester European DOS (CP850)" File from UTF8?

EDIT:

This function, derived from http://www.msdynamics.de/viewtopic.php?f=17&t=25726#p138532, solved my problem:

Function ConvertAndReplace_UTF8_OEM850
{
Param ([String]$path)
$path = resolve-path $path
$sourceEncoding = [System.Text.Encoding]::GetEncoding(65001)
$targetEncoding = [System.Text.Encoding]::GetEncoding(850)
$textfile = [System.IO.File]::ReadAllText($path, $sourceencoding)
[System.IO.File]::WriteAllText($path, $textfile, $targetencoding)
Write-host "Content in $path converted from UTF-8 to OEM850"
}

Upvotes: 2

Views: 5306

Answers (1)

mklement0
mklement0

Reputation: 440501

Given that you say that you've fixed the UTF-8-encoded file (so that it contains the original characters), all you need is to transcode the UTF-8 file back to code page 850 (CP850):

If your system's active OEM code page is 850 (verify with chcp):

Set-Content -NoNewline -Encoding OEM $file -Value (Get-Content -Raw -Encoding utf8 $file)

Note: (Get-Content -encoding utf8 $file) | Set-Content -Encoding OEM $file works too, but potentially alters the newline sequences used, and always appends a trailing newline, even if the original file didn't have one. However, this variant may still be the better choice in Windows PowerShell v4 and below, where -NoNewline isn't supported.

If it is not or cannot assumed to be:

In PowerShell [Core] 6+, Set-Content's -Encoding parameter now accepts code-page numbers:

Set-Content -NoNewline -Encoding 850 $file -Value (Get-Content -Raw -Encoding utf8 $file)

In Windows PowerShell (PowerShell versions up to v5.1), direct use of the .NET Framework is needed:

[IO.File]::WriteAllText(
  (Convert-Path $file),
  (Get-Content -Raw -Encoding utf8 $file),
  [Text.Encoding]::GetEncoding(850)
)

Note the use of Convert-Path to ensure that $file is resolved to a full path, which is necessary, because .NET's working directory usually differs from PowerShell's.


In Windows PowerShell, what values the -Encoding parameter accepts is limited to a fixed set that comprises only the active ANSI (Default) and OEM (OEM) code page, based on your system's legacy system locale (language for non-Unicode programs).

In PowerShell [Core] 6+, you can pass any code page by number or even a System.Text.Encoding instance directly.
Conversely, even though OEM can still be used to refer to the active OEM code page, as of v7.0 there is no placeholder for the active ANSI code page - this omission has been reported in this GitHub issue.

Upvotes: 3

Related Questions