How can I replace UTF-8 errors in Ruby without converting to a different encoding?

Question

In order to convert a string to UTF-8 and replace all encoding errors, you can do:

str.encode('utf-8', :invalid=>:replace)

The only problem with this is it doesn't work if str is already UTF-8, in which case any errors remain:

irb> x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
=> "foo\x92bar"
irb> x.valid_encoding?
=> false

To quote the Ruby Docs:

Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.

The obvious workaround is to first convert to a different Unicode encoding and then back to UTF-8:

str.encode('utf-16', :invalid=>:replace).encode('utf-8')

For example:

irb> x = "foo\x92bar".encode('utf-16', :invalid=>:replace).encode('utf-8')
=> "foo�bar"
irb> x.valid_encoding?
=> true

Is there a better way to do this without converting to a dummy encoding?

matt · Accepted Answer

Ruby 2.1 has added a String#scrub method that does what you want:

2.1.0dev :001 > x = "foo\x92bar"
 => "foo\x92bar" 
2.1.0dev :002 > x.valid_encoding?
 => false 
2.1.0dev :003 > y = x.scrub
 => "foo�bar" 
2.1.0dev :004 > y.valid_encoding?
 => true

The same commit also changes the behaviour of encode so that it works when the source and dest encodings are the same:

2.1.0dev :005 > x = "foo\x92bar".encode('utf-8', :invalid=>:replace)
 => "foo�bar" 
2.1.0dev :006 > x.valid_encoding?
 => true

As far as I know there is no built in way to do this before 2.1 (otherwise scrub wouldn’t be needed) so you’ll need to use some workaround technique until 2.1 is released and you can upgrade.

How can I replace UTF-8 errors in Ruby without converting to a different encoding?

Answers (2)

Related Questions