Natália Resende
Natália Resende

Reputation: 311

Regex to detect proper nouns in a list

I have a program that displays a frequency list of words in a text (tokenized text), but I want first: to detect the proper nouns of the text and append them in another list (Cap_nouns) second: Append the nouns that are not in a dictionary in another list (errors),

Later on, I want to create a frequency list for these errors found and another frequency list for the proper nouns found.

My idea to detect the proper nouns was to find the items that start with a capital letter and append them in this list, but it seems that my regular expression for this task does not work.

Can anyone help me with that? My code is below.

from collections import defaultdict
import re
import nltk
from nltk.tokenize import word_tokenize



with open('fr-text.txt') as f:
    freq = word_tokenize(f.read())

with open ('Fr-dictionary_Upper_Low.txt') as fr:
    dic = word_tokenize(fr.read())


#regular expression to detect words with apostrophes and separated by hyphens    
pat=re.compile(r".,:;?!-'%|\b(\w'|w’)+\b|\w+(?:-\w+)+|\d+") 
reg= list(filter(pat.match, freq))
#regular expression for words that start with a capital letter
patt=re.compile(r"\b^A-Z\b")  
c_n= list(filter(patt.match, freq))

d=defaultdict(int)

#Empty list to append the items not found in the dictionary
errors=[ ]
Cnouns=[ ] #Empty list to append the items starting with a capital letter


for w in freq:
    d[w]+=1
    if w in reg:
        continue
    elif w in c_n:
        Cnouns.append(w)
    elif w not in dic:
        errors.append(w)



for w in sorted(d, key=d.get):
    print(w, d[w])


print(errors)
print(Cnouns)

If there is anything else wrong with my code, let me know, please.

Upvotes: 1

Views: 1577

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

As for the regex part, your patterns are "a bit off". Mostly, you miss the notion of character class, [abc] like patterns that match a single char from the set defined in the class.

Regular expression to detect words with apostrophes and separated by hyphens:

pat=re.compile(r"(?:\w+['’])?\w+(?:-(?:\w+['’])?\w+)*") 

See the regex demo. However, it will also match regular numbers, or simple words. To avoid matching them, you may use

pat=re.compile(r"(?:\w+['’])?\w+(?:-(?:\w+['’])?\w+)+|\w+['’]\w+")

See this regex demo.

Details

  • (?:\w+['’])? - an optional non-capturing group matching 1 or 0 occurrences of 1+ word chars followed with either ' or
  • \w+ - 1 or more word chars
  • (?:-(?:\w+['’])?\w+)* - 0 or more repetitions of
    • -(?:\w+['’])? - an optional non-capturing group matching 1 or 0 occurrences of 1+ word chars followed with either ' or
    • \w+ - 1 or more word chars

Next, reg = list(filter(pat.match, freq)) might not do what you need as re.match only matches at the start of the string. You most probably want to use re.match:

reg = list(filter(pat.search, freq))
                      ^^^^^^

A regular expression for words that start with a capital letter can be written like

patt=re.compile(r"\b[A-Z][a-z]*\b")  
c_n= list(filter(patt.search, freq))

See this regex demo

The \b matches a word boundary, the [A-Z] matches any uppercase ASCII letter, the [a-z]* part matches 0 or more lowercase ASCII letters and \b makes sure there is a word boundary after them.

Upvotes: 1

Related Questions