Reputation: 81
Hi I'm wondering how to match words that follow the spelling rule “i before e except after c” (such as brief, receipt, receive, pier). But shouldn't match words that don't follow that rule such as science.
What I have here is incorrect (as science shouldn't match) but it's what I got so far:
I don't really know how to do this without using look behind (which I know isn't very well supported).
Upvotes: 3
Views: 1159
Reputation: 1419
I was analysing the text of a novel, looking for the most common words that obeyed and broke the i before e rule. It would appear you're more likely to use a word that breaks rather than one which obeys the rule.
import operator, re
ie = {}
cei = {}
ei = {}
file = open('1342.txt', 'r')
for line in file:
line = line.lower()
# ditch anything that's not a word or a white-space character
line = re.sub(r'[^\_\w\s]','', line)
# split line into words
line = line.split()
for word in range(len(line)):
# add the word to the appropriate dictionary or increment
# frequency of occurrence if already in that dictionary
if re.search(r'ie', line[word], flags=re.IGNORECASE):
if line[word] in ie:
ie[line[word]] = ie[line[word]] + 1
else:
ie[line[word]] = 1
if re.search(r'cei', line[word], flags=re.IGNORECASE):
if line[word] in cei:
cei[line[word]] = cei[line[word]] + 1
else:
cei[line[word]] = 1
if re.search(r'[a-b,d-z]+ei', line[word], flags=re.IGNORECASE):
if line[word] in ei:
ei[line[word]] = ei[line[word]] + 1
else:
ei[line[word]] = 1
# sort each dictionary and display the 10 most common words from each
x = sorted(ie.items(), key=operator.itemgetter(1))
for word in range(len(x) - 1, len(x) - 10, -1):
print(x[word])
print()
x = sorted(cei.items(), key=operator.itemgetter(1))
for word in range(len(x) - 1, len(x) - 10, -1):
print(x[word])
print()
x = sorted(ei.items(), key=operator.itemgetter(1))
for word in range(len(x) - 1, len(x) - 10, -1):
print(x[word])
Upvotes: 0
Reputation: 626927
I suggest using a regex with a negative lookahead:
/\b(?![a-z]*cie)[a-z]*(?:cei|ie)[a-z]*/i
See the regex demo
Details:
\b
- a leading word boundary(?![a-z]*cie)
- a negative lookahead that fails the match if the word has cie
after 0+ letters[a-z]*
- 0+ letters(?:cei|ie)
- cei
or ie
[a-z]*
- 0+ letters.Upvotes: 0