Julien Genestoux
Julien Genestoux

Reputation: 32972

Ruby Regular Expression to match words, including accents and other UTF8 characters

We're trying to find a regular expression that allows us to split sentences into words. Of course, the immediate answer is to use \w, except that it doesn't split on _which we need. Then, we tried [a-zA-Z0-9] (we'd like to allow for numbers inside words), the problem is that it splits on accents, which are fairly common in many langues...

So, ideally, what regexp should I use to split the following sentence in the following words :

"Je ne déguste pas d'asperges, car je n'aime pas ça"

info

["Je","ne","déguste","pas","d", "asperges", "car","je", "n","aime","pas", "ça"]

Upvotes: 3

Views: 1848

Answers (1)

Brent Newey
Brent Newey

Reputation: 4509

STR = "Je ne déguste pas d'asperges, car je n'aime pas ça"
words = STR.split /[\s,']+/
for w in words
    print w, "\n"
end

The output is:

Je
ne
déguste
pas
d
asperges
car
je
n
aime
pas
ça

Upvotes: 3

Related Questions