Unexpected encoding error using JSON.parse

Question

I've got a rather large JSON file on my Windows machine and it contains stuff like \xE9. When I JSON.parse it, it works fine.

However, when I push the code to my server running CentOS, I always get this: "\xE9" on US-ASCII (Encoding::InvalidByteSequenceError)

Here is the output of file on both machines

Windows:

λ file data.json
data.json: UTF-8 Unicode English text, with very long lines, with no line terminators

CentOS:

$ file data.json
data.json: UTF-8 Unicode English text, with very long lines, with no line terminators

Here is the error I get when trying to parse it:

$ ruby -rjson -e 'JSON.parse(File.read("data.json"))'
/usr/local/rvm/rubies/ruby-2.0.0-p353/lib/ruby/2.0.0/json/common.rb:155:in `encode': "\xC3" on US-ASCII (Encoding::InvalidByteSequenceError)

What could be causing this problem? I've tried using iconv to change the file into every possible encoding I can, but nothing seems to work.

mu is too short · Accepted Answer

"\xE9" is é in ISO-8859-1 (and various other ISO-8859-X encodings and Windows-1250 and ...) and is certainly not UTF-8.

You can get File.read to fix up the encoding for you by using the encoding options:

File.read('data.json',
  :external_encoding => 'iso-8859-1',
  :internal_encoding => 'utf-8'
)

That will give you a UTF-8 encoded string that you can hand to JSON.parse.

Or you could let JSON.parse deal with the encoding by using just :external_encoding to make sure the string comes of the disk with the right encoding flag:

JSON.parse(
  File.read('data.json',
    :external_encoding => 'iso-8859-1',
  )
)

You should have a close look at data.json to figure out why file(1) thinks it is UTF-8. The file might incorrectly have a BOM when it is not UTF-8 or someone might be mixing UTF-8 and Latin-1 encoded strings in one file.

Unexpected encoding error using JSON.parse

Answers (1)

Related Questions