Reputation: 25
For words that do match the rule, I'm using:
\b\w*(cei|\bie|(?!c)\w(?=ie))\w*\b
But what about words that DO NOT match the popular rule, such as "science" or even "foreign?"
Upvotes: 1
Views: 210
Reputation: 600
You were almost there! I tweaked your existing version to work without lookbehinds:
\b\w*(?:cie|\bei|(?!c)\wei)\w*\b
The difference is that \bei|(?!c)\wei
is looking for a non-c \w
followed by "ie", or \b
followed by "ie" (to match words like "either"). The lookbehind version finds the same things, but instead by looking for "ie"s not preceded by a c.
You can look at your problem as "words matching the rule 'e before i, except after c'", and then it's obvious that you can take your "i before e" solution and just flip the i's and e's, which is basically exactly what I did.
Your "solved problem" (the regex that does fit the rule) doesn't really work for all cases. The word "eighties", for instance, does contain "tie", which is an i before e after a non-c, so your regex does match it. But it also starts with "ei", which is an e before i not after c. So we need stricter rules for what does follow the rule:
\w
.\w
.This is actually a pretty fun problem; I suspect there are several ways of solving it, some of which I haven't come up with yet and may be better. Still, here is my solution to "words that follow the rule":
\b(?!ei)((\w(?!ie|ei))*(cei|((?!c)\w|\b)ie))+(\w(?!ie|ei))*\b
Breaking down the logic behind it:
\b(?!ei) # guarantee 2.a
(
(\w(?!ie|ei))* # consume as many \w not followed by ie or ei as possible
(cei|((?!c)\w|\b)ie) # 1.b or 1.c or 1.a (exclusively: none of 2.)
)+ # guarantee at least 1 match of 1.
(\w(?!ie|ei))*\b # any trailing \w after the last match of 1. can't match 2.
Other notes:
(\w(?!ie|ei))
can't start a match for (cei|((?!c)\w|\b)ie)
.\w
might not be the way to go; consider [a-z]
.Upvotes: 2