Reputation: 28437
To make a comparable study, I am working with data that has already been tokenised (not with spacy). I need to use these tokens as input to ensure that I work with the same data across the board. I wish to feed these tokens into spaCy's tagger, but the following fails:
import spacy
nlp = spacy.load('en', disable=['tokenizer', 'parser', 'ner', 'textcat'])
sent = ['I', 'like', 'yellow', 'bananas']
doc = nlp(sent)
for i in doc:
print(i)
with the following trace
Traceback (most recent call last):
File "C:/Users/bmvroy/.PyCharm2018.2/config/scratches/scratch_6.py", line 6, in <module>
doc = nlp(sent)
File "C:\Users\bmvroy\venv\lib\site-packages\spacy\language.py", line 346, in __call__
doc = self.make_doc(text)
File "C:\Users\bmvroy\venv\lib\site-packages\spacy\language.py", line 378, in make_doc
return self.tokenizer(text)
TypeError: Argument 'string' has incorrect type (expected str, got list)
First of all, I'm not sure why spaCy tries to tokenize the input as I disabled the tokenizer in the load()
statement. Second, evidently this is not the way to go.
I am looking for a way to feed the tagger a list of tokens. Is that possible with spaCy?
I tried the solution provided by @aab combined with info from the documentation but to no avail:
from spacy.tokens import Doc
from spacy.lang.en import English
from spacy.pipeline import Tagger
nlp = English()
tagger = Tagger(nlp.vocab)
words = ['Listen', 'up', '.']
spaces = [True, False, False]
doc = Doc(nlp.vocab, words=words, spaces=spaces)
processed = tagger(doc)
print(processed)
This code didn't run, and gave the following error:
processed = tagger(doc)
File "pipeline.pyx", line 426, in spacy.pipeline.Tagger.__call__
File "pipeline.pyx", line 438, in spacy.pipeline.Tagger.predict
AttributeError: 'bool' object has no attribute 'tok2vec'
Upvotes: 2
Views: 1474
Reputation: 11474
You need to use the alternate way of constructing a document directly using the Doc
class. Here's the example from their docs (https://spacy.io/api/doc):
from spacy.tokens import Doc
doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'], spaces=[True, False, False])
The spaces
argument (whether each token is followed by a space) is optional.
Then you can run the components you need, so the whole thing would look like:
import spacy
from spacy.tokens import Doc
nlp = spacy.load('en')
doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'], spaces=[True, False, False])
nlp.tagger(doc)
nlp.parser(doc)
for t in doc:
print(t.text, t.pos_, t.tag_, t.dep_, t.head)
Upvotes: 3