pguardiario
pguardiario

Reputation: 54984

ruby 1.9 wrong file encoding on windows

I have a ruby file with these contents:

# encoding: iso-8859-1
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}
puts File.read('foo.txt').encoding

Can someone explain what's happening here?

UPDATE

Here's a better description of what I'm looking for:

Upvotes: 3

Views: 2381

Answers (1)

Darshan Rivka Whittle
Darshan Rivka Whittle

Reputation: 34031

You're not specifying the encoding when you read the file. You're being very careful to specify it everywhere except there, but then you're reading it with the default encoding.

File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'.force_encoding('iso-8859-1')}
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding }

# => ISO-8859-1

Also note that you probably mean 'fòo'.encode('iso-8859-1') rather than 'fòo'.force_encoding('iso-8859-1'). The latter leaves the bytes unchanged, while the former transcodes the string.

Update: I'll elaborate a bit since I wasn't as clear or thorough as I could have been.

  1. If you don't specify an encoding with File.read(), the file will be read with Encoding.default_external. Since you're not setting that yourself, Ruby is using a value depending on the environment it's run in. In your Windows environment, it's IBM437; in your Cygwin environment, it's UTF-8. So my point above was that of course that's what the encoding is; it has to be, and it has nothing to do with what bytes are contained in the file. Ruby doesn't auto-detect encodings for you.

  2. force_encoding() doesn't change the bytes in a string, it only changes the Encoding attached to those bytes. If you tell Ruby "pretend this string is ISO-8859-1", then it won't transcode them when you tell it "please write this string as ISO-8859-1". encode() transcodes for you, as does writing to the file if you don't trick it into not doing so.

Putting those together, if you have a source file in ISO-8859-1:

# encoding: iso-8859-1

# Write in ISO-8859-1 regardless of default_external
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}

# Read in ISO-8859-1 regardless of default_external,
#  transcoding if necessary to default_internal, if set
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding } # => ISO-8859-1

puts File.read('foo.txt').encoding # -> Whatever is specified by default_external

If you have a source file in UTF-8:

# encoding: utf-8

# Write in ISO-8859-1 regardless of default_external, transcoding from UTF-8
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}

# Read in ISO-8859-1 regardless of default_external,
#  transcoding if necessary to default_internal, if set
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding } # => ISO-8859-1

puts File.read('foo.txt').encoding # -> Whatever is specified by default_external

Update 2, to answer your new questions:

  1. No, the # encoding: iso-8859-1 line does not change Encoding.default_external, it only tells Ruby that the source file itself is encoded in ISO-8859-1. Simply add

    Encoding.default_external = "iso-8859-1"
    

    if you expect all files that your read to be stored in that encoding.

  2. No, I don't personally think Ruby should auto-detect encodings, but reasonable people can disagree on that one, and a discussion of "should it be so" seems off-topic here.

  3. Personally, I use UTF-8 for everything, and in the rare circumstances that I can't control encoding, I manually set the encoding when I read the file, as demonstrated above. My source files are always in UTF-8. If you're dealing with files that you can't control and don't know the encoding of, the charguess gem or similar would be useful.

Upvotes: 7

Related Questions