tigrou
tigrou

Reputation: 4516

How to extract consonant / vowels groups from a word ?

I would like to wrote a regex that split a word into cvc (consonant/vowel/consonant) or vcv groups. Something similar to ngrams but using voyels and consonants. Here is an example :

helloworld

would produce the following groups :

hell
ello
llow
owo
world

I have wrote the following regex :

(?=(([aeiouy]+|[^aeiouy]+){3}))

The first part ([aeiouy]+|[^aeiouy]+){3} capture either a vcv or cvc group, the rest (?=( )) is a positive lookahead assertion. It doesn't work as expected :

hell
ello
llow
low //owo expected

Upvotes: 0

Views: 577

Answers (2)

AndreyS Scherbakov
AndreyS Scherbakov

Reputation: 2788

By putting all patterns into lookahead part you make it non-greedy in choosing a start match point. Use one explicit and two lookahead V/C sequence instead:

r = re.compile('(?:([aeiouy]+)(?=([^aeiouy]+[aeiouy]+)))|(?:([^aeiouy]+)(?=([aeiouy]+[^aeiouy]+)))')

Then simply concatenate the groups

map (lambda l:''.join(l), re.findall(r,"Helloworld"))

Upvotes: 0

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89574

If you use a lookahead alone, characters are not consumed, and the parser tries all positions in the string (in other words, it is not able to jump more than one character at a time).

You can solve the problem like this:

(?=((?:[aeiou]+|[b-dfghj-np-tv-z]+){3}))(?:[aeiou]+|[b-dfghj-np-tv-z]+)

demo

Now the leading vowels (or consonants) are consumed for each match (outside of the lookahead).

Upvotes: 1

Related Questions