Named Entity Recognition Python

Question

What I want to do: Extract all the occurrences of n consecutive words that all begin with a capital letter.

Input: ("Does John Doe eat pizza in New York?", 2)
Output: [("Does", "John"),("John", "Doe")("New","York")]

Input: ("Does John Doe eat pizza in New York?", 3)
Output: [("Does", "John","Doe")]

Here is what I have come up with so far:

# create text file
fw = open("ngram.txt", "w")
fw.write ("Does John Doe eat pizza in New York?")
fw.close()

def UpperCaseNGrams (file,n):
    fr = open (file, "r")
    text = fr.read().split()

    ngramlist = [text[word:word+n] for word in range(len(text)-(n-1)) if word[0].isupper() if word+n[0].isupper()]  
    return ngramlist

print (UpperCaseNGrams("ngram.txt",2))

I get the following error:
TypeError: 'int' object is not subscriptable

What do I have to change in order for it to work?

mhawke · Accepted Answer

In word+n[0].isupper(), both word and n are of type int and therefore can not be indexed using [], i.e. integers are not subscriptable.

I think that your intention is to check that the nth word past the current one starts with a capital, however, that would be done with text[word+n][0]. Regardless, I don't think that your method is going to work for values of n other than 2, e.g. if n were 3 you would need to check that all words between the current one and the nth word past the current one are capitalised.

The easiest fix is to use all() to check that each sublist of words begin with a capital:

ngramlist = [text[word:word+n] for word in range(len(text)-(n-1))
                 if all(s[0].isupper() for s in text[word:word+n])]

If you want something a bit faster you could do something like this to group runs of capitalised words together:

from itertools import groupby

text = 'Does John Doe eat pizza in New York?'.split()
caps_words = [list(v) for g,v in groupby(text, key=lambda x: x[0].isupper()) if g]
print(caps_words)

This would output

[['Does', 'John', 'Doe'], ['New', 'York?']]

Now you need to extract sublists of length n from each run:

ngrams = []
n = 2
for run in caps_words:
    ngrams.extend(run[i:i+n] for i in range(len(run)-(n-1)))

which results in ngrams:

[['Does', 'John'], ['John', 'Doe'], ['New', 'York?']]

and for n = 3:

[['Does', 'John', 'Doe']]

Putting that all together (and turning the ngram accumulator into a list comprehension) results in a function like this:

from itertools import groupby

def upper_case_ngrams(words, n):
    caps_words = [list(v) for g,v in groupby(words, key=lambda x: x[0].isupper()) if g]
    return [tuple(run[i:i+n]) for run in caps_words
                for i in range(len(run)-(n-1))]

text = 'Does John Doe eat pizza in New York?'.split()
for n in range(1, 5):
   print(upper_case_ngrams(text, n))

Output

[('Does',), ('John',), ('Doe',), ('New',), ('York?',)]
[('Does', 'John'), ('John', 'Doe'), ('New', 'York?')]
[('Does', 'John', 'Doe')]
[]

Named Entity Recognition Python

Answers (1)

Related Questions