Reputation: 22113
In order to convert a string to UTF-8 and replace all encoding errors, you can do:
str.encode('utf-8', :invalid=>:replace)
The only problem with this is it doesn't work if str
is already UTF-8, in which case any errors remain:
irb> x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
=> "foo\x92bar"
irb> x.valid_encoding?
=> false
To quote the Ruby Docs:
Please note that conversion from an encoding
enc
to the same encodingenc
is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.
The obvious workaround is to first convert to a different Unicode encoding and then back to UTF-8:
str.encode('utf-16', :invalid=>:replace).encode('utf-8')
For example:
irb> x = "foo\x92bar".encode('utf-16', :invalid=>:replace).encode('utf-8')
=> "foo�bar"
irb> x.valid_encoding?
=> true
Is there a better way to do this without converting to a dummy encoding?
Upvotes: 13
Views: 5037
Reputation: 79723
Ruby 2.1 has added a String#scrub
method that does what you want:
2.1.0dev :001 > x = "foo\x92bar"
=> "foo\x92bar"
2.1.0dev :002 > x.valid_encoding?
=> false
2.1.0dev :003 > y = x.scrub
=> "foo�bar"
2.1.0dev :004 > y.valid_encoding?
=> true
The same commit also changes the behaviour of encode
so that it works when the source and dest encodings are the same:
2.1.0dev :005 > x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
=> "foo�bar"
2.1.0dev :006 > x.valid_encoding?
=> true
As far as I know there is no built in way to do this before 2.1 (otherwise scrub
wouldn’t be needed) so you’ll need to use some workaround technique until 2.1 is released and you can upgrade.
Upvotes: 21
Reputation: 8003
Try this:
"foo\x92bar".chars.select(&:valid_encoding?).join
# => "foobar"
Or to replace
"foo\x92bar".chars.map{|c| c.valid_encoding? ? c : "?"}.join
# => "foo?bar"
Upvotes: 6