Ernesto G
Ernesto G

Reputation: 545

Remove non alphanumeric chars from string preserving accentuated chars

I need to remove the chars such as "+", "/", "_" and similar from strings in order to perform a search method.

According to other question here, I had this using the gsub method, the problem is that it also substitutes the accentuated letters, which I don't want to:

string.gsub(/[^0-9A-Za-z]/, '')

EDIT: The languagues I need to support are spanish and catalonian.

Is there any way to adapt the expresion to preserve the letters with accents?

Upvotes: 0

Views: 114

Answers (3)

Aleksei Matiushkin
Aleksei Matiushkin

Reputation: 121000

Both answers given here so far are plain wrong.

There are two types of accents in the modern unicode: composed and combined diacritics (decomposed.) With Ruby 2.3+ everything is easy:

"Barça".unicode_normalize(:nfc).scan(/\p{L}/)
#⇒ ["B", "a", "r", "ç", "a"]

The above will work no matter how “ç” was constructed, as a Latin1 composed character, or as a combined diacritics.

That said, to remove all non letters, one would do:

"Barça".unicode_normalize(:nfc).gsub(/[^\p{L}]/, '')

Before Ruby 2.3 there was no standard way to normalize a string to composed form, and while for “mañana” the simple range À..ÿ would work (composed form,) for “mañana” it won’t (combined diacritics.) You might ensure there is a difference yourself by copy-pasting both into your irb shell.

Upvotes: 4

radoAngelov
radoAngelov

Reputation: 714

You can also use a POSIX bracket expression. You will find all needed documentation in the ruby-docs.

In your case you can use either:

string.gsub(/[^[:alpha:]]/, '')

or:

string.gsub(/[^[:alnum:]]/, '')

From the documentation:

/[[:alnum:]]/ - Alphabetic and numeric character

/[[:alpha:]]/ - Alphabetic character

Upvotes: 1

Aaron Christiansen
Aaron Christiansen

Reputation: 11807

Borrowing from answers to this question, the regex character range for many, but not all, accented characters is À-ÿ. Therefore, to match these too, you can simply add this to your existing ranges:

string.gsub(/[^0-9A-Za-zÀ-ÿ]/, '')

It largely depends on the accents you're looking for, since there are too many accents to easily match all of them. This example regex will preserve for instance acute/grave accents, but misses crescents:

puts "I went to a café.".gsub(/[^0-9A-Za-zÀ-ÿ]/, '') # Iwenttoacafé
puts "Ahoj, světe!".gsub(/[^0-9A-Za-zÀ-ÿ]/, '')      # Ahojsvte

This might be fine for your use case, but if you're dealing with, say, Czech text, you might need additional character ranges to match crescents.

Upvotes: 0

Related Questions