jaegernoms
jaegernoms

Reputation: 25

What is a regular expression (regex) for matching words that DO NOT follow the "i before e except after c" rule?

For words that do match the rule, I'm using:

\b\w*(cei|\bie|(?!c)\w(?=ie))\w*\b

But what about words that DO NOT match the popular rule, such as "science" or even "foreign?"

Upvotes: 1

Views: 210

Answers (1)

KernelPanic
KernelPanic

Reputation: 600

Answering the question you asked

You were almost there! I tweaked your existing version to work without lookbehinds:

 \b\w*(?:cie|\bei|(?!c)\wei)\w*\b

The difference is that \bei|(?!c)\wei is looking for a non-c \w followed by "ie", or \b followed by "ie" (to match words like "either"). The lookbehind version finds the same things, but instead by looking for "ie"s not preceded by a c.

You can look at your problem as "words matching the rule 'e before i, except after c'", and then it's obvious that you can take your "i before e" solution and just flip the i's and e's, which is basically exactly what I did.

Answering the question you didn't ask, but which is more interesting

Your "solved problem" (the regex that does fit the rule) doesn't really work for all cases. The word "eighties", for instance, does contain "tie", which is an i before e after a non-c, so your regex does match it. But it also starts with "ei", which is an e before i not after c. So we need stricter rules for what does follow the rule:

  1. a) Starts with "ie" OR b) Contains "cei" OR c) Contains an "ie" that follows a non-c \w.
  2. a) Does not start with "ei" AND b) Does not contain "cie" AND c) Does not contain an "ei" that follows a non-c \w.

This is actually a pretty fun problem; I suspect there are several ways of solving it, some of which I haven't come up with yet and may be better. Still, here is my solution to "words that follow the rule":

\b(?!ei)((\w(?!ie|ei))*(cei|((?!c)\w|\b)ie))+(\w(?!ie|ei))*\b

Breaking down the logic behind it:

\b(?!ei)                 # guarantee 2.a
(
  (\w(?!ie|ei))*         # consume as many \w not followed by ie or ei as possible
  (cei|((?!c)\w|\b)ie)   # 1.b or 1.c or 1.a (exclusively: none of 2.)
)+                       # guarantee at least 1 match of 1.
(\w(?!ie|ei))*\b         # any trailing \w after the last match of 1. can't match 2.

Other notes:

  1. Yup, that's nesting repetition. Should not cause catastrophic backtracking since both parts are mutually exclusive: (\w(?!ie|ei)) can't start a match for (cei|((?!c)\w|\b)ie).
  2. I'm making an assumption that the sequence "ieei" won't appear; my solution matches it when it strictly shouldn't, but this sequence doesn't appear in my computer's dictionary so I'll consider it an edge case to chew on later.
  3. As with your example, this will only work for all-lowercase c, e, and i. If you're only looking at lowercase strings, \w might not be the way to go; consider [a-z].

Upvotes: 2

Related Questions