Ilia Ross
Ilia Ross

Reputation: 13412

Perl JSON encode in UTF-8 strange behaviour

Based on Perl JSON 2.90 documentation, to encode JSON object in UTF-8 all you need to do is:

$json_text = JSON->new->utf8->encode($perl_scalar)

That is obvious and this what I did. After a while, I got an issue report on GitHub from one of users, which made me really surprised, as it shouldn't be happening!

I was beating for hours to figure out what was happening but the solution happened to be very weird and wrong from my point of view.

What eventually worked for me is this:

$json_text = JSON->new->latin1->encode($perl_scalar)

After that, I tested this code with all different characters, including Russian and Chinese - it just worked?

Can anyone please explain, why encoding is working correctly with latin1 and not with utf8, when it's actually has to be visa versa?

Upvotes: 1

Views: 1358

Answers (2)

ikegami
ikegami

Reputation: 385867

Two possible bugs could result in the described outcome.

  1. You were passing strings already encoded using UTF-8 to encode.

    If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.C3 A9, are suffering from this bug.

    If you are suffering from the this bug, properly decode all inputs to your program, and continue using JSON->new->utf8->encode (aka encode_json).

  2. You were encoding the output of the JSON command using UTF-8 a second time, possibly via a :utf8 or :encoding layer on a file handle.

    If $string contains installé and sprintf '%vX', $string returns 69.6E.73.74.61.6C.6C.E9, are suffering from this bug.

    If you are suffering from the this bug, either use JSON->new->encode (aka to_json) and keep the second layer of encoding, or use JSON->new->utf8->encode (aka encode_json) and remove the second layer of encoding.

In neither case is the solution to use JSON->new->latin1->encode.

Upvotes: 3

Leon Timmermans
Leon Timmermans

Reputation: 30225

What are you doing to output $json_text? What kind of binmode do you use on that handle? The screenshot looks like it's double-encoded, which suggests the handle has :utf8 or :encoding enabled (which is incorrect for writing encoded data to). As unintuitively as it may seem, ->latin1 giving a correct result matches that hypothesis (PerlIO assumes any binary string is encoded as latin-1).

Upvotes: 3

Related Questions