Reputation: 3576
I have a large file with two different encodings. The "main" file is UTF-8, but some characters like <80>
(€ in isoxxx) or <9F>
(ß in isoxxx) are in ISO-8859-1 encoding. I can use this to replace the invalid characters:
string.encode("iso8859-1", "utf-8", {:invalid => :replace, :replace => "-"}).encode("utf-8")
The problem is, that I need this wrong encoded characters, so replacing to "-" is useless for me. How can i fix the wrong encoded characters in the document with ruby?
EDIT: I've tried the :fallback
option, but with no success (no replacements where made):
string.encode("iso8859-1", "utf-8",
:fallback => {"\x80" => "123"}
)
Upvotes: 3
Views: 527
Reputation: 5545
Here is a very faster version of my previous code, compatible with Ruby 1.8 and 1.9.
I could identify invalid utf8 chars with regex, and I convert only them.
class String
# Regexp for invalid UTF8 chars.
# $1 will be valid utf8 sequence;
# $3 will be the invalid utf8 char.
INVALID_UTF8 = Regexp.new(
'(([\xc0-\xdf][\x80-\xbf]{1}|' +
'[\xe0-\xef][\x80-\xbf]{2}|' +
'[\xf0-\xf7][\x80-\xbf]{3}|' +
'[\xf8-\xfb][\x80-\xbf]{4}|' +
'[\xfc-\xfd][\x80-\xbf]{5})*)' +
'([\x80-\xff]?)', nil, 'n')
if RUBY_VERSION >= '1.9'
# ensure each char is utf8, assuming that
# bad characters are in the +encoding+ encoding
def utf8_ignore!(encoding)
# avoid bad characters errors and encoding incompatibilities
force_encoding('ascii-8bit')
# encode only invalid utf8 chars within string
gsub!(INVALID_UTF8) do |s|
$1 + $3.force_encoding(encoding).encode('utf-8').force_encoding('ascii-8bit')
end
# final string is in utf-8
force_encoding('utf-8')
end
else
require 'iconv'
# ensure each char is utf8, assuming that
# bad characters are in the +encoding+ encoding
def utf8_ignore!(encoding)
# encode only invalid utf8 chars within string
gsub!(INVALID_UTF8) do |s|
$1 + Iconv.conv('utf-8', encoding, $3)
end
end
end
end
# "\xe3" = "ã" in iso-8859-1
# mix valid with invalid utf8 chars, which is in iso-8859-1
a = "ãb\xe3"
a.utf8_ignore!('iso-8859-1')
puts a #=> ãbã
Upvotes: 1
Reputation: 5545
I used the following code (Ruby 1.8.7). It tests each char >= 128 ASCII to check whether it's the beginning of a valid utf-8 sequence. If not, it's assumed to be iso8859-1 and converts it to utf-8.
Due the fact your file is large, this procedure can be very slow!
class String
# Grants each char in the final string is utf-8-compliant.
# based on http://php.net/manual/en/function.utf8-encode.php#39986
def utf8
ret = ''
# scan the string
# I'd use self.each_byte do |b|, but I'll need to change i
a = self.unpack('C*')
i = 0
l = a.length
while i < l
b = a[i]
i += 1
# if it's ascii, don't do anything.
if b < 0x80
ret += b.chr
next
end
# check whether it's the beginning of a valid utf-8 sequence
m = [0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe]
n = 0
n += 1 until n > m.length || (b & m[n]) == m[n-1]
# if not, convert it to utf-8
if n > m.length
ret += [b].pack('U')
next
end
# if yes, check if the rest of the sequence is utf8, too
r = [b]
u = false
# n bytes matching 10bbbbbb follow?
n.times do
if i < l
r << a[i]
u = (a[i] & 0xc0) == 0x80
i += 1
else
u = false
end
break unless u
end
# if not, converts it!
ret += r.pack(u ? 'C*' : 'U*')
end
ret
end
def utf8!
replace utf8
end
end
# let s be the string containing your file.
s2 = s.utf8
# or
s.utf8!
Upvotes: 1
Reputation: 18835
are you looking for something like this?
http://jalada.co.uk/2011/12/07/solving-latin1-and-utf8-errors-for-good-in-ruby.html
Upvotes: 0