Reputation: 15
What I want to do: Extract all the occurrences of n consecutive words that all begin with a capital letter.
Input: ("Does John Doe eat pizza in New York?", 2)
Output: [("Does", "John"),("John", "Doe")("New","York")]
Input: ("Does John Doe eat pizza in New York?", 3)
Output: [("Does", "John","Doe")]
Here is what I have come up with so far:
# create text file
fw = open("ngram.txt", "w")
fw.write ("Does John Doe eat pizza in New York?")
fw.close()
def UpperCaseNGrams (file,n):
fr = open (file, "r")
text = fr.read().split()
ngramlist = [text[word:word+n] for word in range(len(text)-(n-1)) if word[0].isupper() if word+n[0].isupper()]
return ngramlist
print (UpperCaseNGrams("ngram.txt",2))
I get the following error:
TypeError: 'int' object is not subscriptable
What do I have to change in order for it to work?
Upvotes: 0
Views: 243
Reputation: 87054
In word+n[0].isupper()
, both word
and n
are of type int
and therefore can not be indexed using []
, i.e. integers are not subscriptable.
I think that your intention is to check that the nth word past the current one starts with a capital, however, that would be done with text[word+n][0]
. Regardless, I don't think that your method is going to work for values of n
other than 2, e.g. if n
were 3 you would need to check that all words between the current one and the nth word past the current one are capitalised.
The easiest fix is to use all()
to check that each sublist of words begin with a capital:
ngramlist = [text[word:word+n] for word in range(len(text)-(n-1))
if all(s[0].isupper() for s in text[word:word+n])]
If you want something a bit faster you could do something like this to group runs of capitalised words together:
from itertools import groupby
text = 'Does John Doe eat pizza in New York?'.split()
caps_words = [list(v) for g,v in groupby(text, key=lambda x: x[0].isupper()) if g]
print(caps_words)
This would output
[['Does', 'John', 'Doe'], ['New', 'York?']]
Now you need to extract sublists of length n
from each run:
ngrams = []
n = 2
for run in caps_words:
ngrams.extend(run[i:i+n] for i in range(len(run)-(n-1)))
which results in ngrams
:
[['Does', 'John'], ['John', 'Doe'], ['New', 'York?']]
and for n
= 3:
[['Does', 'John', 'Doe']]
Putting that all together (and turning the ngram accumulator into a list comprehension) results in a function like this:
from itertools import groupby
def upper_case_ngrams(words, n):
caps_words = [list(v) for g,v in groupby(words, key=lambda x: x[0].isupper()) if g]
return [tuple(run[i:i+n]) for run in caps_words
for i in range(len(run)-(n-1))]
text = 'Does John Doe eat pizza in New York?'.split()
for n in range(1, 5):
print(upper_case_ngrams(text, n))
Output
[('Does',), ('John',), ('Doe',), ('New',), ('York?',)] [('Does', 'John'), ('John', 'Doe'), ('New', 'York?')] [('Does', 'John', 'Doe')] []
Upvotes: 1