How to read a ANSI text file and convert strings to UTF-8 in Ruby 1.9?

Question

I am trying to read text files with Ruby 1.9 and convert them into my own XML structure. I don't have control over the source text file so they could be in any encoding.

Here is what I do at the moment:

lines = File.readlines(input_file)
lines.each do |line|
  #do something
end

I have a problem with a file that contains the é character (xE9). When I try to process the corresponding line I get a Invalid byte sequence in UTF-8 exception when I call .match(...) on the string.

I tried to use the workaround described at Fixing invalid UTF-8 in Ruby, revisited

lines = File.readlines(input_file)
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
lines.each do |line|
  unless line.empty?
   valid_string = ic.iconv(line + ' ')[0..-2]
   #do something
  end
end

but this simply strips the é character from the line which is not what I want.

I think the real problem is that the file itself doesn't seem to be in UTF-8 but uses some ANSI encoding. Although the file is not UTF-8 the resulting line object says it is UTF-8 when calling .encoding; My guess is that I need to use a different way to read the file so that it works for both ANSI and UTF-8 files but I am a Ruby beginner and I really don't know where to start.

the Tin Man · Accepted Answer

The character is part of the ISO-8859-1 and Win-1252 character sets, among others. The second is probably the most popular character set for Windows, and is your most likely source.

RUBY_VERSION # => "1.9.2"

That's my Ruby version running the following tests. Note that in the following samples the # encoding lines aren't comments, they're directives to Ruby on which character set to use when unencoded binary characters are found:

# encoding: Windows-1252

RUBY_VERSION # => "1.9.2"

asdf = "\xe9"
asdf.encoding # => #
asdf.encode('UTF-8') # => "é"
asdf.encode('UTF-8').encoding # => #

This shows the character in ISO-8859-1:

# encoding: ISO-8859-1

RUBY_VERSION # => "1.9.2"

asdf = "\xe9"
asdf.encoding # => #
asdf.encode('UTF-8') # => "é"
asdf.encode('UTF-8').encoding # => #

James Gray did a series of articles a couple years ago about dealing with this stuff. It's good reading.

Now, back to trying to figure out what character set a character could be in: When you only have one character, because it could be in several sets at once, it is difficult to determine which set it is. If you have more characters >= "\x80" then you can run through the characters sets iconv support and try converting them. That's messy, but I had to do that in Perl for some screen scraping about five years ago. An alternative is to use the Python chardet code.

James Gray's articles have a link to an article recommending rchardet.

The above routines mention Mozilla's Charset Detectors, which will give you more info on dealing with this.

How to read a ANSI text file and convert strings to UTF-8 in Ruby 1.9?

Answers (2)

Related Questions