Reputation: 1054
Hy guys, I'm starting to study NLTK following the official book from the NLTK team.
I'm in chapter 5-"Tagging"- and I can't resolve one of the excercises at page 186 of the PDF version:
Given the list of past participles specified by cfd2['VN'].keys(), try to collect a list of all the word-tag pairs that immediately precede items in that list.
I tried this way:
wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True)
[wsj[wsj.index((word,tag))-1:wsj.index((word,tag))+1] for (word,tag) in wsj if word in cfd2['VN'].keys()]
but it gives me this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 401, in iterate_from
for tok in piece.iterate_from(max(0, start_tok-offset)):
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 295, in iterate_from
self._stream.seek(filepos)
AttributeError: 'NoneType' object has no attribute 'seek'
I think I'm doing something wrong in accessing the wsj structure, but I can't figure out what is wrong!
Can you help me?
Thanks in advance!
Upvotes: 1
Views: 1961
Reputation: 721
wsj
is of type ConcatenatedCorpusView
, and I think it is choking on an empty tuple ('.', '.')
. The easiest solution is to cast ConcatenatedCorpusView
to a list
explicitly. You can do that by doing:
wsj = list(wsj)
Iteration works fine then. Getting the index of a duplicate item is a separate problem. See: https://gist.github.com/denten/11388676
Upvotes: 0
Reputation: 4410
wsj
is of type nltk.corpus.reader.util.ConcatenatedCorpusView
that behaves like a list (this is why you can use functions like index()
), but "behind the scenes" NLTK never reads the whole list into memory, it will only read those parts from a file object that it needs. It seems that if you iterate over a CorpusView object and use index()
(which requires iterating again) at the same time, the file object will return None
.
This way it works, though it is less elegant than a list comprehension:
for i in range(len(wsj)):
if wsj[i][0] in cfd2['VN'].keys():
print wsj[(i-1):(i+1)]
Upvotes: 2
Reputation: 1
Looks like both the index call and the slicing cause an exception:
wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True)
cfd2 = nltk.ConditionalFreqDist((t,w) for w,t in wsj)
wanted = cfd2['VN'].keys()
# just getting the index -> exception before 60 items
for w, t in wsj:
if w in wanted:
print wsj.index((w,t))
# just slicing -> sometimes finishes, sometimes throws exception
for i, (w,t) in enumerate(wsj):
if w in wanted:
print wsj[i-1:i+1]
I'm guessing it's caused by accessing previous items in a stream that you are iterating over.
It works fine if you iterate once over wsj
to create a list of indices and use them in a second iteration to grab the slices:
results = [
wsj[j-1:j+1]
for j in [
i for i, (w,t) in enumerate(wsj)
if w in wanted
]
]
As a side note: calling index
without a start
argument will return the first match every time.
Upvotes: 0