Reputation: 311
I have a program that displays a frequency list of words in a text (tokenized text), but I want first: to detect the proper nouns of the text and append them in another list (Cap_nouns) second: Append the nouns that are not in a dictionary in another list (errors),
Later on, I want to create a frequency list for these errors found and another frequency list for the proper nouns found.
My idea to detect the proper nouns was to find the items that start with a capital letter and append them in this list, but it seems that my regular expression for this task does not work.
Can anyone help me with that? My code is below.
from collections import defaultdict
import re
import nltk
from nltk.tokenize import word_tokenize
with open('fr-text.txt') as f:
freq = word_tokenize(f.read())
with open ('Fr-dictionary_Upper_Low.txt') as fr:
dic = word_tokenize(fr.read())
#regular expression to detect words with apostrophes and separated by hyphens
pat=re.compile(r".,:;?!-'%|\b(\w'|w’)+\b|\w+(?:-\w+)+|\d+")
reg= list(filter(pat.match, freq))
#regular expression for words that start with a capital letter
patt=re.compile(r"\b^A-Z\b")
c_n= list(filter(patt.match, freq))
d=defaultdict(int)
#Empty list to append the items not found in the dictionary
errors=[ ]
Cnouns=[ ] #Empty list to append the items starting with a capital letter
for w in freq:
d[w]+=1
if w in reg:
continue
elif w in c_n:
Cnouns.append(w)
elif w not in dic:
errors.append(w)
for w in sorted(d, key=d.get):
print(w, d[w])
print(errors)
print(Cnouns)
If there is anything else wrong with my code, let me know, please.
Upvotes: 1
Views: 1577
Reputation: 627082
As for the regex part, your patterns are "a bit off". Mostly, you miss the notion of character class, [abc]
like patterns that match a single char from the set defined in the class.
Regular expression to detect words with apostrophes and separated by hyphens:
pat=re.compile(r"(?:\w+['’])?\w+(?:-(?:\w+['’])?\w+)*")
See the regex demo. However, it will also match regular numbers, or simple words. To avoid matching them, you may use
pat=re.compile(r"(?:\w+['’])?\w+(?:-(?:\w+['’])?\w+)+|\w+['’]\w+")
See this regex demo.
Details
(?:\w+['’])?
- an optional non-capturing group matching 1 or 0 occurrences of 1+ word chars followed with either '
or ’
\w+
- 1 or more word chars(?:-(?:\w+['’])?\w+)*
- 0 or more repetitions of
-(?:\w+['’])?
- an optional non-capturing group matching 1 or 0 occurrences of 1+ word chars followed with either '
or ’
\w+
- 1 or more word charsNext, reg = list(filter(pat.match, freq))
might not do what you need as re.match
only matches at the start of the string. You most probably want to use re.match
:
reg = list(filter(pat.search, freq))
^^^^^^
A regular expression for words that start with a capital letter can be written like
patt=re.compile(r"\b[A-Z][a-z]*\b")
c_n= list(filter(patt.search, freq))
See this regex demo
The \b
matches a word boundary, the [A-Z]
matches any uppercase ASCII letter, the [a-z]*
part matches 0 or more lowercase ASCII letters and \b
makes sure there is a word boundary after them.
Upvotes: 1