Danwen Huang
Danwen Huang

Reputation: 81

Regex - match words that follow the rule "i before e except after c"

Hi I'm wondering how to match words that follow the spelling rule “i before e except after c” (such as brief, receipt, receive, pier). But shouldn't match words that don't follow that rule such as science.

What I have here is incorrect (as science shouldn't match) but it's what I got so far:

current progress enter link description here

I don't really know how to do this without using look behind (which I know isn't very well supported).

Upvotes: 3

Views: 1159

Answers (4)

Clarius
Clarius

Reputation: 1419

I was analysing the text of a novel, looking for the most common words that obeyed and broke the i before e rule. It would appear you're more likely to use a word that breaks rather than one which obeys the rule.

import operator, re

ie  = {}
cei = {}
ei  = {}

file = open('1342.txt', 'r')

for line in file:

        line = line.lower()

        # ditch anything that's not a word or a white-space character
        line = re.sub(r'[^\_\w\s]','', line)

        # split line into words
        line = line.split()

        for word in range(len(line)):

                # add the word to the appropriate dictionary or increment 
                # frequency of occurrence if already in that dictionary

                if re.search(r'ie', line[word], flags=re.IGNORECASE):
                        if line[word] in ie:
                                ie[line[word]] = ie[line[word]] + 1
                        else:
                                ie[line[word]] = 1

                if re.search(r'cei', line[word], flags=re.IGNORECASE):
                        if line[word] in cei:
                                cei[line[word]] = cei[line[word]] + 1
                        else:
                                cei[line[word]] = 1

                if re.search(r'[a-b,d-z]+ei', line[word], flags=re.IGNORECASE):
                        if line[word] in ei:
                                ei[line[word]] = ei[line[word]] + 1
                        else:
                                ei[line[word]] = 1

# sort each dictionary and display the 10 most common words from each

x = sorted(ie.items(), key=operator.itemgetter(1))
for word in range(len(x) - 1, len(x) - 10, -1):
    print(x[word])

print()

x = sorted(cei.items(), key=operator.itemgetter(1))
for word in range(len(x) - 1, len(x) - 10, -1):
    print(x[word])

print()

x = sorted(ei.items(), key=operator.itemgetter(1))
for word in range(len(x) - 1, len(x) - 10, -1):
    print(x[word])

Upvotes: 0

Mustofa Rizwan
Mustofa Rizwan

Reputation: 10466

You can try this:

\b\w*(cei|\bie|(?!c)\w(?=ie))\w*\b

Explanation

enter image description here

Upvotes: 2

BlueMonkMN
BlueMonkMN

Reputation: 25601

This seems simple enough:

[A-Za-z]*(cei|[^c]ie)[A-Za-z]*

Upvotes: 4

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626927

I suggest using a regex with a negative lookahead:

/\b(?![a-z]*cie)[a-z]*(?:cei|ie)[a-z]*/i

See the regex demo

Details:

  • \b - a leading word boundary
  • (?![a-z]*cie) - a negative lookahead that fails the match if the word has cie after 0+ letters
  • [a-z]* - 0+ letters
  • (?:cei|ie) - cei or ie
  • [a-z]* - 0+ letters.

Upvotes: 0

Related Questions