How to extract consonant / vowels groups from a word ?

Question

I would like to wrote a regex that split a word into cvc (consonant/vowel/consonant) or vcv groups. Something similar to ngrams but using voyels and consonants. Here is an example :

helloworld

would produce the following groups :

hell
ello
llow
owo
world

I have wrote the following regex :

(?=(([aeiouy]+|[^aeiouy]+){3}))

The first part ([aeiouy]+|[^aeiouy]+){3} capture either a vcv or cvc group, the rest (?=( )) is a positive lookahead assertion. It doesn't work as expected :

hell
ello
llow
low //owo expected

Casimir et Hippolyte · Accepted Answer

If you use a lookahead alone, characters are not consumed, and the parser tries all positions in the string (in other words, it is not able to jump more than one character at a time).

You can solve the problem like this:

(?=((?:[aeiou]+|[b-dfghj-np-tv-z]+){3}))(?:[aeiou]+|[b-dfghj-np-tv-z]+)

demo

Now the leading vowels (or consonants) are consumed for each match (outside of the lookahead).

How to extract consonant / vowels groups from a word ?

Answers (2)

Related Questions