jcollum
jcollum

Reputation: 46589

what is the best way to reliably remove unicode from strings

I have a variety of strings that I need to pull the 'TM', '(c)' etc from. These marks are in unicode. Right now I just want to pull all of the unicode out, once I get that working I'll be a little more selective and just pull out the legalese marks. Here's the code:

strings = ['Star Wars \u2122 2', 'Empire Strikes Back\u00C2\u00AE The Strikening',
       "Star Wars\u2122 2", "Empire Strikes Back\u00C2\\u00AE The Strikening"]

p strings.inspect

strings.each { |str|
  sub = str.gsub(/\\(u[(\d)a-fA-F]{4})/, "")
  p sub
}

Results in:

["Star Wars \\u2122 2", "Empire Strikes Back\\u00C2\\u00AE The Strikening", "Star Wars\u2122 2", "Empire Strikes Back\u00C2\\u00AE The Strikening"]
"Star Wars  2"
"Empire Strikes Back The Strikening"
"Star Wars\u2122 2"
"Empire Strikes Back\u00C2 The Strikening"

Works for single quotes, but not double. I understand that single quoted strings behave differently than double quoted strings. The issue here is that the strings that are being fed into this function are behaving as double quoted strings and breaking the code (substitution doesn't happen). I tried adding sub('\\', '\\\\') before the gsub but that didn't fix it.

I think I'm misunderstanding something about how strings behave in Ruby. How can I remove a unicode symbol from "Star Wars\u2122 2" reliably? The regex that I have isn't doing it.

Ruby 1.9.3

Upvotes: 1

Views: 1993

Answers (2)

David Grayson
David Grayson

Reputation: 87406

This might be a little inefficient because it builds an array with every character in it, but it will work (in Ruby 1.9 and later):

s = "Empire Strikes Back\u00C2\u00AE The Strikening"
t = s.chars.select(&:ascii_only?).join    # => "Empire Strikes Back The Strikening"

When you write '\u00C2' you are not creating a string that contains unicode. You are creating a string with 5 ASCII characters. When you write "\u00C2" you are creating a string with a single non-ASCII unicode character. That's one of the differences between double-quote notation and single-quote notation.

Upvotes: 6

Marnen Laibow-Koser
Marnen Laibow-Koser

Reputation: 6337

Just do 'String with ™ and ®'.delete '™®'.

Also, what's your use case for removing non-ASCII characters? Unless you're doing something like building a URL slug, this is probably not a great idea to begin with. If you are building a URL slug, there are lots of gems (such as friendly_id) that will do this for you.

Upvotes: 1

Related Questions