Matt
Matt

Reputation: 22113

How can I replace UTF-8 errors in Ruby without converting to a different encoding?

In order to convert a string to UTF-8 and replace all encoding errors, you can do:

str.encode('utf-8', :invalid=>:replace)

The only problem with this is it doesn't work if str is already UTF-8, in which case any errors remain:

irb> x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
=> "foo\x92bar"
irb> x.valid_encoding?
=> false

To quote the Ruby Docs:

Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.

The obvious workaround is to first convert to a different Unicode encoding and then back to UTF-8:

str.encode('utf-16', :invalid=>:replace).encode('utf-8')

For example:

irb> x = "foo\x92bar".encode('utf-16', :invalid=>:replace).encode('utf-8')
=> "foo�bar"
irb> x.valid_encoding?
=> true

Is there a better way to do this without converting to a dummy encoding?

Upvotes: 13

Views: 5037

Answers (2)

matt
matt

Reputation: 79723

Ruby 2.1 has added a String#scrub method that does what you want:

2.1.0dev :001 > x = "foo\x92bar"
 => "foo\x92bar" 
2.1.0dev :002 > x.valid_encoding?
 => false 
2.1.0dev :003 > y = x.scrub
 => "foo�bar" 
2.1.0dev :004 > y.valid_encoding?
 => true 

The same commit also changes the behaviour of encode so that it works when the source and dest encodings are the same:

2.1.0dev :005 > x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
 => "foo�bar" 
2.1.0dev :006 > x.valid_encoding?
 => true 

As far as I know there is no built in way to do this before 2.1 (otherwise scrub wouldn’t be needed) so you’ll need to use some workaround technique until 2.1 is released and you can upgrade.

Upvotes: 21

tihom
tihom

Reputation: 8003

Try this:

 "foo\x92bar".chars.select(&:valid_encoding?).join
  # => "foobar"

Or to replace

"foo\x92bar".chars.map{|c| c.valid_encoding? ? c : "?"}.join
 # =>  "foo?bar"

Upvotes: 6

Related Questions