Dennis G
Dennis G

Reputation: 21788

Decode or unescape \u00f0\u009f\u0091\u008d to 👍

We all know UTF-8 is hard. I exported my messages from Facebook and the resulting JSON file escaped all non-ascii characters to unicode code points.

I am looking for an easy way to unescape these unicode code points to regular old UTF-8. I also would love to use PowerShell.

I tried

$str = "\u00f0\u009f\u0091\u008d"
[Regex]::Replace($str, "\\[Uu]([0-9A-Fa-f]{4})", `
{[char]::ToString([Convert]::ToInt32($args[0].Groups[1].Value, 16))} )

but that only gives me ð as a result, not 👍.

I also tried using Notepad++ and I found this SO post: How to convert escaped Unicode (e.g. \u0432\u0441\u0435) to UTF-8 chars (все) in Notepad++. The accepted answer also results in exactly the same as the example above: ð.

I found the decoding solution here: the UTF8.js library that decodes the text perfectly and you can try it out here (with \u00f0\u009f\u0091\u008d as input).

Is there a way in PowerShell to decode \u00f0\u009f\u0091\u008d to receive 👍? I'd love to have real UTF-8 in my exported Facebook messages so I can actually read them.

Bonus points for helping me understand what \u00f0\u009f\u0091\u008d actually represents (besides it being some UTF-8 hex representation). Why is it the same as U+1F44D or \uD83D\uDC4D in C++?

Upvotes: 10

Views: 19134

Answers (3)

Garric
Garric

Reputation: 734

What pleases in mklement0 example - it is easy to get an encoded string of this type.

What is bad - the line will be huge. (First 2 nibbles '00' is a waste)

I must admit, the mklement0 example is charming.

The code for encoding - one line only!!!:

$emoji='A 👍 for Motörhead.'
[Reflection.Assembly]::LoadWithPartialName("System.Web") | Out-Null
$str=(([System.Web.HttpUtility]::UrlEncode($emoji)) -replace '%','\u00') -replace '\+',' '
$str

You can decode this by the standard url way:

$str="A \u00f0\u009f\u0091\u008d for Mot\u00c3\u00b6rhead."
$str=$str -replace '\\u00','%'
[Reflection.Assembly]::LoadWithPartialName("System.Web") | Out-Null
[System.Web.HttpUtility]::UrlDecode($str)

A 👍 for Motörhead.

Upvotes: 1

Garric
Garric

Reputation: 734

iso-8859-1 - very often - intermediate member in operations with Utf-8

$text=[regex]::Unescape("A \u00f0\u009f\u0091\u008d for Mot\u00c3\u00b6rhead.")
Write-Host "[regex]::Unescape(utf-8) = $text"
$encTo=[System.Text.Encoding]::GetEncoding('iso-8859-1') # Change it to yours (iso-8859-2) i suppose
$bytes = $encTo.GetBytes($Text)
$text=[System.Text.Encoding]::UTF8.GetString($bytes)
Write-Host "utf8_DecodedFrom_8859_1 = $text"

[regex]::Unescape(utf-8) = A ð for Motörhead.

utf8_DecodedFrom_8859_1 = A 👍 for Motörhead.

Upvotes: 2

mklement0
mklement0

Reputation: 439597

The Unicode code point of the 👍character is U+1F44D.

Using the variable-length UTF-8 encoding, the following 4 bytes (expressed as hex. numbers) are needed to represent this code point: F0 9F 91 8D.

While these bytes are recognizable in your string,

$str = "\u00f0\u009f\u0091\u008d"

they shouldn't be represented as \u escape codes, because they're not Unicode code units / code point, they're bytes.

With a 4-hex-digit escape sequence (UTF-16), the proper representation would require 2 16-bit Unicode code units, a so-called surrogate pair, which together represent the single non-BMP code point U+1F44D:

$str = "\uD83D\uDC4D"

If your JSON input used such proper Unicode escapes, PowerShell would process the string correctly; e.g.:

'{ "str": "\uD83D\uDC4D" }' | ConvertFrom-Json > out.txt

If you examine file out.txt, you'll see something like:

str
---
👍 

(The output was sent to a file, because console windows wouldn't render the 👍char. correctly, at least not without additional configuration; note that if you used PowerShell Core on Linux or macOS, however, terminal output would work.)


Therefore, the best solution would be to correct the problem at the source and use proper Unicode escapes (or even use the characters themselves, as long as the source supports any of the standard Unicode encodings).

If you really must parse the broken representation, try the following workaround (PSv4+), building on your own [regex]::Replace() technique:

$str = "A \u00f0\u009f\u0091\u008d for Mot\u00c3\u00b6rhead."

[regex]::replace($str, '(?:\\u[0-9a-f]{4})+', { param($m) 
  $utf8Bytes = (-split ($m.Value -replace '\\u([0-9a-f]{4})', '0x$1 ')).ForEach([byte])
  [text.encoding]::utf8.GetString($utf8Bytes)
})

This should yield A 👍 for Motörhead.

The above translates sequences of \u... escapes into the byte values they represent and interprets the resulting byte array as UTF-8 text.


To save the decoded string to a UTF-8 file, use ... | Set-Content -Encoding utf8 out.txt

Alternatively, in PSv5+, as Dennis himself suggests, you can make Out-File and therefore it's virtual alias, >, default to UTF-8 via PowerShell's global parameter-defaults hashtable:

$PSDefaultParameterValues['Out-File:Encoding'] = 'utf8'

Note, however, that on Windows PowerShell (as opposed to PowerShell Core) you'll get an UTF-8 file with a BOM in both cases - avoiding that requires direct use of the .NET framework: see Using PowerShell to write a file in UTF-8 without the BOM

Upvotes: 9

Related Questions