Reputation: 4516
I would like to wrote a regex that split a word into cvc (consonant/vowel/consonant) or vcv groups. Something similar to ngrams but using voyels and consonants. Here is an example :
helloworld
would produce the following groups :
hell
ello
llow
owo
world
I have wrote the following regex :
(?=(([aeiouy]+|[^aeiouy]+){3}))
The first part ([aeiouy]+|[^aeiouy]+){3}
capture either a vcv or cvc group, the rest (?=( ))
is a positive lookahead assertion.
It doesn't work as expected :
hell
ello
llow
low //owo expected
Upvotes: 0
Views: 577
Reputation: 2788
By putting all patterns into lookahead part you make it non-greedy in choosing a start match point. Use one explicit and two lookahead V/C sequence instead:
r = re.compile('(?:([aeiouy]+)(?=([^aeiouy]+[aeiouy]+)))|(?:([^aeiouy]+)(?=([aeiouy]+[^aeiouy]+)))')
Then simply concatenate the groups
map (lambda l:''.join(l), re.findall(r,"Helloworld"))
Upvotes: 0
Reputation: 89574
If you use a lookahead alone, characters are not consumed, and the parser tries all positions in the string (in other words, it is not able to jump more than one character at a time).
You can solve the problem like this:
(?=((?:[aeiou]+|[b-dfghj-np-tv-z]+){3}))(?:[aeiou]+|[b-dfghj-np-tv-z]+)
Now the leading vowels (or consonants) are consumed for each match (outside of the lookahead).
Upvotes: 1