OliverLx
OliverLx

Reputation: 33

Convert a string in PowerShell (in Europe) to UTF-8

For a REST call I need the German "Stück" in UTF-8 as read from an access database with

$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source=$filename;Persist Security Info=False;")

and try to convert it. I have found out that PowerShell ISE seems to encode string constants in ANSI. So I tried as a minimum test without database and got the same result:

$Text1 = "Stück" # entered via ISE, this is also what I get from the database
# ($StringFromDatabase -eq $Test1) shows $true

$enc = [System.Text.Encoding]::GetEncoding(1252).GetBytes($Text1)
# also tried [System.Text.Encoding]::GetEncoding("ISO-8859-1") # = 28591

$Text1 = [System.Text.Encoding]::UTF8.GetString($enc)

$Text1
$Text1 = "Stück" # = UTF-8, entered here with Notepad++, encoding set to UTF-8
"must see: $Text1"

So I get two outputs - the converted one (showing "St?ck") but I need to see "Stück".

Upvotes: 3

Views: 9268

Answers (1)

mklement0
mklement0

Reputation: 437353

that PowerShell ISE seems to encode string constants in ANSI.

That only applies when communicating with external programs, whereas you're using in-process .NET APIs.

As an aside: this discrepancy with regular console windows, which use the active OEM code page is one of the reasons that make the obsolescent ISE problematic - see the bottom section of this answer for more information.

String literals in memory are always .NET strings, which are UTF-16-encoded (composed of 16-bit Unicode code units), capable of representing all Unicode characters.[1]


Character encoding in web-service calls (Invoke-RestMethod, Invoke-WebRequest):

To send UTF-8 strings, specify charset=utf-8 as part of the -ContentType argument; e.g.:

Invoke-RestMethod -ContentType 'text/plain; charset=utf-8' ...

On receiving strings, PowerShell automatically decodes them either based on an explicitly specified charset field (character encoding) in the response's content header or, in its absence using ISO-8859-1 (which is closely related to, but in effect a subset of Windows-1252).

  • If a given response doesn't specify a charset but in actually uses a different encoding from ISO-8859-1 - say UTF-8 - PowerShell will misinterpret the strings received, which requires re-encoding after the fact - see this answer.

Character encoding when communicating with external programs:

If you need to send a string with a particular encoding to an external program (via the pipeline, which the target program receives via stdin), set the $OutputEncoding preference variable to that encoding, and PowerShell will automatically convert your .NET strings to the specified encoding.

To send UTF-8-encoded strings to external programs via the pipeline:

$OutputEncoding = [System.Text.UTF8Encoding]::new()

Note, however, that this alone isn't sufficient in order to correctly receive UTF-8 output from external programs; for that, you need to set [Console]::OutputEncoding to the same encoding.

To make your PowerShell session fully UTF-8-aware (irrespective of whether in the ISE or a regular console window):

# Needed in the ISE only:
chcp >$null # Dummy console-program call that ensures that a console is allocated.

# Set all encodings relevant to communicating with external programs to UTF-8.
$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding =
  [System.Text.UTF8Encoding]::new()

See this answer for more information.


[1] Note, however, that Unicode characters with a code point greater than 0xFFFF, i.e. those outside the so-called BMP (Basic Multilingual Plane), must be represented with two 16-bit code units ([char]), namely so-called surrogate pairs.

Upvotes: 3

Related Questions