Reputation: 14381
I am trying to read text files with Ruby 1.9 and convert them into my own XML structure. I don't have control over the source text file so they could be in any encoding.
Here is what I do at the moment:
lines = File.readlines(input_file)
lines.each do |line|
#do something
end
I have a problem with a file that contains the é
character (xE9). When I try to process the corresponding line I get a Invalid byte sequence in UTF-8
exception when I call .match(...)
on the string.
I tried to use the workaround described at Fixing invalid UTF-8 in Ruby, revisited
lines = File.readlines(input_file)
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
lines.each do |line|
unless line.empty?
valid_string = ic.iconv(line + ' ')[0..-2]
#do something
end
end
but this simply strips the é
character from the line which is not what I want.
I think the real problem is that the file itself doesn't seem to be in UTF-8 but uses some ANSI encoding. Although the file is not UTF-8 the resulting line object says it is UTF-8 when calling .encoding
; My guess is that I need to use a different way to read the file so that it works for both ANSI and UTF-8 files but I am a Ruby beginner and I really don't know where to start.
Upvotes: 3
Views: 6225
Reputation: 160551
The character is part of the ISO-8859-1 and Win-1252 character sets, among others. The second is probably the most popular character set for Windows, and is your most likely source.
RUBY_VERSION # => "1.9.2"
That's my Ruby version running the following tests. Note that in the following samples the # encoding
lines aren't comments, they're directives to Ruby on which character set to use when unencoded binary characters are found:
# encoding: Windows-1252
RUBY_VERSION # => "1.9.2"
asdf = "\xe9"
asdf.encoding # => #<Encoding:Windows-1252>
asdf.encode('UTF-8') # => "é"
asdf.encode('UTF-8').encoding # => #<Encoding:UTF-8>
This shows the character in ISO-8859-1:
# encoding: ISO-8859-1
RUBY_VERSION # => "1.9.2"
asdf = "\xe9"
asdf.encoding # => #<Encoding:ISO-8859-1>
asdf.encode('UTF-8') # => "é"
asdf.encode('UTF-8').encoding # => #<Encoding:UTF-8>
James Gray did a series of articles a couple years ago about dealing with this stuff. It's good reading.
Now, back to trying to figure out what character set a character could be in: When you only have one character, because it could be in several sets at once, it is difficult to determine which set it is. If you have more characters >= "\x80" then you can run through the characters sets iconv
support and try converting them. That's messy, but I had to do that in Perl for some screen scraping about five years ago. An alternative is to use the Python chardet
code.
James Gray's articles have a link to an article recommending rchardet
.
The above routines mention Mozilla's Charset Detectors, which will give you more info on dealing with this.
Upvotes: 3
Reputation: 619
you could try it on the console, this may be a hint:
I do it with a system command like this:
iconv -f windows-1252 -t UTF-8 "#{csv_file}" > #{Rails.root}/tmp/Kdvakanz-utf8.csv
Upvotes: 2