Reputation: 545
I need to remove the chars such as "+", "/", "_" and similar from strings in order to perform a search method.
According to other question here, I had this using the gsub method, the problem is that it also substitutes the accentuated letters, which I don't want to:
string.gsub(/[^0-9A-Za-z]/, '')
EDIT: The languagues I need to support are spanish and catalonian.
Is there any way to adapt the expresion to preserve the letters with accents?
Upvotes: 0
Views: 114
Reputation: 121000
Both answers given here so far are plain wrong.
There are two types of accents in the modern unicode: composed and combined diacritics (decomposed.) With Ruby 2.3+ everything is easy:
"Barça".unicode_normalize(:nfc).scan(/\p{L}/)
#⇒ ["B", "a", "r", "ç", "a"]
The above will work no matter how “ç” was constructed, as a Latin1 composed character, or as a combined diacritics.
That said, to remove all non letters, one would do:
"Barça".unicode_normalize(:nfc).gsub(/[^\p{L}]/, '')
Before Ruby 2.3 there was no standard way to normalize a string to composed form, and while for “mañana” the simple range À..ÿ
would work (composed form,) for “mañana” it won’t (combined diacritics.) You might ensure there is a difference yourself by copy-pasting both into your irb
shell.
Upvotes: 4
Reputation: 714
You can also use a POSIX bracket expression
. You will find all needed documentation in the ruby-docs.
In your case you can use either:
string.gsub(/[^[:alpha:]]/, '')
or:
string.gsub(/[^[:alnum:]]/, '')
From the documentation:
/[[:alnum:]]/ - Alphabetic and numeric character
/[[:alpha:]]/ - Alphabetic character
Upvotes: 1
Reputation: 11807
Borrowing from answers to this question, the regex character range for many, but not all, accented characters is À-ÿ
. Therefore, to match these too, you can simply add this to your existing ranges:
string.gsub(/[^0-9A-Za-zÀ-ÿ]/, '')
It largely depends on the accents you're looking for, since there are too many accents to easily match all of them. This example regex will preserve for instance acute/grave accents, but misses crescents:
puts "I went to a café.".gsub(/[^0-9A-Za-zÀ-ÿ]/, '') # Iwenttoacafé
puts "Ahoj, světe!".gsub(/[^0-9A-Za-zÀ-ÿ]/, '') # Ahojsvte
This might be fine for your use case, but if you're dealing with, say, Czech text, you might need additional character ranges to match crescents.
Upvotes: 0