Reputation: 151926
I'm filtering Facebook Messenger JSON dumps with jq
. The source JSON contains emojis as Unicode sequences. How can I output these back as emojis?
echo '{"content":"\u00f0\u009f\u00a4\u00b7\u00f0\u009f\u008f\u00bf\u00e2\u0080\u008d\u00e2\u0099\u0082\u00ef\u00b8\u008f"}' | jq -c '.'
Actual result:
{"content":"ð¤·ð¿ââï¸"}
Desired result:
{"content":"🤷🏿♂️"}
Upvotes: 0
Views: 988
Reputation: 116690
Here's a jq-only solution. It works with both the C and Go implementations of jq.
# input: a decimal integer
# output: the corresponding binary array, most significant bit first
def binary_digits:
if . == 0 then 0
else [recurse( if . == 0 then empty else ./2 | floor end ) % 2]
| reverse
| .[1:] # remove the leading 0
end ;
def binary_to_decimal:
reduce reverse[] as $b ({power:1, result:0};
.result += .power * $b
| .power *= 2)
| .result;
# input: an array of decimal integers representing the utf-8 bytes of a Unicode codepoint.
# output: the corresponding decimal number of that codepoint.
def utf8_decode:
# Magic numbers:
# x80: 128, # 10000000
# xe0: 224, # 11100000
# xf0: 240 # 11110000
(-6) as $mb # non-first bytes start 10 and carry 6 bits of data
# first byte of a 2-byte encoding starts 110 and carries 5 bits of data
# first byte of a 3-byte encoding starts 1110 and carries 4 bits of data
# first byte of a 4-byte encoding starts 11110 and carries 3 bits of data
| map(binary_digits) as $d
| .[0]
| if . < 128 then $d[0]
elif . < 224 then [$d[0][-5:][], $d[1][$mb:][]]
elif . < 240 then [$d[0][-4:][], $d[1][$mb:][], $d[2][$mb:][]]
else [$d[0][-3:][], $d[1][$mb:][], $d[2][$mb:][], $d[3][$mb:][]]
end
| binary_to_decimal ;
{"content":"\u00f0\u009f\u00a4\u00b7\u00f0\u009f\u008f\u00bf\u00e2\u0080\u008d\u00e2\u0099\u0082\u00ef\u00b8\u008f"}
| .content|= (explode| [utf8_decode] | implode)
Transcript:
$ jq -nM -f program.jq
{
"content": "🤷"
}
Upvotes: 1
Reputation: 299275
@chepner's use of Latin1 in Python finally shook free in my head how to do with jq almost directly. You'll need to pipe through iconv:
$ echo '{"content":"\u00f0\u..."}' | jq -c . | iconv -t latin1
{"content":"🤷🏿♂️"}
In JSON, the string \u00f0
does not mean "the byte 0xF0, as part of a UTF-8 encoded sequence." It means "Unicode code point 0x00F0." That's ð, and jq is displaying it correctly as the UTF-8 encoding 0xc3 0xb0.
The iconv call reinterprets the UTF-8 string for ð (0xc3 0xb0) back into Latin1 as 0xf0 (Latin1 exactly matches the first 255 Unicode code points). Your UTF-8 capable terminal then interprets that as the first byte of a UTF-8 sequence.
Upvotes: 3
Reputation: 530950
The problem is that the response contains the UTF-8 encoding of the Unicode code points, not the code points themselves. jq
cannot decode this itself. You could use another language; for example, in Python
>>> x = json.load(open("response.json"))['content']
>>> x
'ð\x9f¤·ð\x9f\x8f¿â\x80\x8dâ\x99\x82ï¸\x8f'
>>> x.encode('latin1').decode()
'🤷🏿\u200d♂️'
It's not exact, but I'm not sure the encoding is unambiguous. For example,
>>> x.encode('latin1')
b'\xf0\x9f\xa4\xb7\xf0\x9f\x8f\xbf\xe2\x80\x8d\xe2\x99\x82\xef\xb8\x8f'
>>> '🤷🏿♂️'.encode()
b'\xf0\x9f\xa4\xb7\xf0\x9f\x8f\xbf\xe2\x80\x8d\xe2\x99\x82\xef\xb8\x8f'
>>> '🤷🏿♂️'.encode().decode()
'🤷🏿\u200d♂️'
The result of re-encoding the response using Latin-1 is identical to encoding the desired emoji as UTF-8, but decoding doesn't not give back precisely the same emoji (or at least, Python isn't rendering it identically.)
Upvotes: 1
Reputation: 18430
First of all, you need a font which supports this.
You are confusing Unicode composed chars with UTF-8 encoding. It has to be either:
$ echo '{"content":"\u1F937\u200D\u2642"}' | jq -c '.'
or
$ echo '{"content":"\u1F937\u200D\u2642\uFE0F"}' | jq -c '.'
Upvotes: -1