Dan Dascalescu
Dan Dascalescu

Reputation: 151926

Convert emoji Unicode byte sequences to Unicode characters with jq

I'm filtering Facebook Messenger JSON dumps with jq. The source JSON contains emojis as Unicode sequences. How can I output these back as emojis?

echo '{"content":"\u00f0\u009f\u00a4\u00b7\u00f0\u009f\u008f\u00bf\u00e2\u0080\u008d\u00e2\u0099\u0082\u00ef\u00b8\u008f"}' | jq -c '.'

Actual result:

{"content":"ð¤·ð¿ââï¸"}

Desired result:

{"content":"🤷🏿‍♂️"}

Upvotes: 0

Views: 988

Answers (4)

peak
peak

Reputation: 116690

Here's a jq-only solution. It works with both the C and Go implementations of jq.

# input: a decimal integer
# output: the corresponding binary array, most significant bit first
def binary_digits:
  if . == 0 then 0
  else [recurse( if . == 0 then empty else ./2 | floor end ) % 2]
    | reverse
    | .[1:] # remove the leading 0
  end ;

def binary_to_decimal:
  reduce reverse[] as $b ({power:1, result:0};
       .result += .power * $b
       | .power *= 2)
  | .result;

# input: an array of decimal integers representing the utf-8 bytes of a Unicode codepoint.
# output: the corresponding decimal number of that codepoint.
def utf8_decode:
   # Magic numbers:
   # x80: 128,       # 10000000
   # xe0: 224,       # 11100000
   # xf0: 240        # 11110000
     (-6) as $mb     # non-first bytes start 10 and carry 6 bits of data
                     # first byte of a 2-byte encoding starts 110 and carries 5 bits of data
                     # first byte of a 3-byte encoding starts 1110 and carries 4 bits of data
                     # first byte of a 4-byte encoding starts 11110 and carries 3 bits of data
   | map(binary_digits) as $d
   | .[0]
   | if   . < 128 then $d[0]
     elif . < 224 then [$d[0][-5:][], $d[1][$mb:][]]
     elif . < 240 then [$d[0][-4:][], $d[1][$mb:][], $d[2][$mb:][]]
     else              [$d[0][-3:][], $d[1][$mb:][], $d[2][$mb:][], $d[3][$mb:][]]
     end
   | binary_to_decimal ;
{"content":"\u00f0\u009f\u00a4\u00b7\u00f0\u009f\u008f\u00bf\u00e2\u0080\u008d\u00e2\u0099\u0082\u00ef\u00b8\u008f"}
| .content|= (explode| [utf8_decode] | implode)

Transcript:

$ jq -nM -f program.jq
{
  "content": "🤷"
}

Upvotes: 1

Rob Napier
Rob Napier

Reputation: 299275

@chepner's use of Latin1 in Python finally shook free in my head how to do with jq almost directly. You'll need to pipe through iconv:

$ echo '{"content":"\u00f0\u..."}' | jq -c . | iconv -t latin1
{"content":"🤷🏿‍♂️"}

In JSON, the string \u00f0 does not mean "the byte 0xF0, as part of a UTF-8 encoded sequence." It means "Unicode code point 0x00F0." That's ð, and jq is displaying it correctly as the UTF-8 encoding 0xc3 0xb0.

The iconv call reinterprets the UTF-8 string for ð (0xc3 0xb0) back into Latin1 as 0xf0 (Latin1 exactly matches the first 255 Unicode code points). Your UTF-8 capable terminal then interprets that as the first byte of a UTF-8 sequence.

Upvotes: 3

chepner
chepner

Reputation: 530950

The problem is that the response contains the UTF-8 encoding of the Unicode code points, not the code points themselves. jq cannot decode this itself. You could use another language; for example, in Python

>>> x = json.load(open("response.json"))['content']
>>> x
'ð\x9f¤·ð\x9f\x8f¿â\x80\x8dâ\x99\x82ï¸\x8f'
>>> x.encode('latin1').decode()
'🤷🏿\u200d♂️'

It's not exact, but I'm not sure the encoding is unambiguous. For example,

>>> x.encode('latin1')
b'\xf0\x9f\xa4\xb7\xf0\x9f\x8f\xbf\xe2\x80\x8d\xe2\x99\x82\xef\xb8\x8f'
>>> '🤷🏿‍♂️'.encode()
b'\xf0\x9f\xa4\xb7\xf0\x9f\x8f\xbf\xe2\x80\x8d\xe2\x99\x82\xef\xb8\x8f'
>>> '🤷🏿‍♂️'.encode().decode()
'🤷🏿\u200d♂️'

The result of re-encoding the response using Latin-1 is identical to encoding the desired emoji as UTF-8, but decoding doesn't not give back precisely the same emoji (or at least, Python isn't rendering it identically.)

Upvotes: 1

Michael-O
Michael-O

Reputation: 18430

First of all, you need a font which supports this.

You are confusing Unicode composed chars with UTF-8 encoding. It has to be either:

$ echo '{"content":"\u1F937\u200D\u2642"}' | jq -c '.'

or

$ echo '{"content":"\u1F937\u200D\u2642\uFE0F"}' | jq -c '.'

Upvotes: -1

Related Questions