Reputation: 927
How do I delete non-UTF8 characters from a ruby string? I have a string that has for example "xC2" in it. I want to remove that char from the string so that it becomes a valid UTF8.
This:
text = x = "foo\xC2bar"
text.gsub!(/\xC2/, '')
returns an error:
incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)
I was looking at text.unpack('U*') and string.pack as well, but did not get anywhere.
Upvotes: 41
Views: 26849
Reputation: 1
Use String encode method with param 'replace' to return a string without invalid chars
'MyString'.encode('UTF-8', :invalid => :replace, :undef => :replace, :replace => '')
or using bang to change the string
'MyString'.encode!('UTF-8', :invalid => :replace, :undef => :replace, :replace => '')
Upvotes: 0
Reputation: 10630
You can use encode for that.
text.encode('UTF-8', :invalid => :replace, :undef => :replace)
Or text.scrub
For more info look into Ruby-Docs, replaces it by default with a question mark box.
Upvotes: 123
Reputation: 1410
You text have ASCII-8BIT encoding, instead you should use this:
String.delete!("^\u{0000}-\u{007F}");
It will serve the same purpose.
Upvotes: 7
Reputation: 905
The best solution to this problem that I've found is this answer to the same question: https://stackoverflow.com/a/8711118/363293.
In short: "€foo\xA0".chars.select(&:valid_encoding?).join
Upvotes: 3
Reputation: 42192
You could do it like this
# encoding: utf-8
class String
def validate_encoding
chars.select(&:valid_encoding?).join
end
end
puts "testing\xC2 a non UTF-8 string".validate_encoding
#=>testing a non UTF-8 string
Upvotes: 11
Reputation: 9146
Try Iconv
1.9.3p194 :001 > require 'iconv'
# => true
1.9.3p194 :002 > string = "testing\xC2 a non UTF-8 string"
# => "testing\xC2 a non UTF-8 string"
1.9.3p194 :003 > ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
# => #<Iconv:0x000000026c9290>
1.9.3p194 :004 > ic.iconv string
# => "testing a non UTF-8 string"
Upvotes: 4
Reputation: 204768
You can use /n
, as in
text.gsub!(/\xC2/n, '')
to force the Regexp to operate on bytes.
Are you sure this is what you want, though? Any Unicode character in the range [U+80, U+BF] will have a \xC2
in its UTF-8 encoded form.
Upvotes: 5