Reubend
Reubend

Reputation: 664

spaCy Classifier: 'unicode' object has no attribute 'to_array'

I'm trying to code a minimal text classifier with spaCy. I wrote the following snippet of code to train just the text categorizer (without training the whole NLP pipeline):

import spacy
from spacy.pipeline import TextCategorizer
nlp = spacy.load('en')

doc1 = u'This is my first document in the dataset.'
doc2 = u'This is my second document in the dataset.'

gold1 = u'Category1'
gold2 = u'Category2'

textcat = TextCategorizer(nlp.vocab)
textcat.add_label('Category1')
textcat.add_label('Category2')
losses = {}
optimizer = textcat.begin_training()
textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)

But when I run it, it returns an error. Here is the traceback it gives me when I start it:

Traceback (most recent call last):
  File "C:\Users\Reuben\Desktop\Classification\Classification\Training.py", line
 16, in <module>
    textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
  File "pipeline.pyx", line 838, in spacy.pipeline.TextCategorizer.update
  File "D:\Program Files\Anaconda2\lib\site-packages\thinc\api.py", line 61, in
begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "D:\Program Files\Anaconda2\lib\site-packages\thinc\api.py", line 176, in
 begin_update
    values = [fwd(X, *a, **k) for fwd in forward]
  File "D:\Program Files\Anaconda2\lib\site-packages\thinc\api.py", line 258, in
 wrap
    output = func(*args, **kwargs)
  File "D:\Program Files\Anaconda2\lib\site-packages\thinc\api.py", line 61, in
begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "D:\Program Files\Anaconda2\lib\site-packages\spacy\_ml.py", line 95, in
_preprocess_doc
    keys = [doc.to_array(LOWER) for doc in docs]
AttributeError: 'unicode' object has no attribute 'to_array'

How can I fix this?

Upvotes: 1

Views: 722

Answers (1)

Reubend
Reubend

Reputation: 664

Apparently textcat expects gold values which where made with GoldParse, not plaintext values. The working version looks like this:

import spacy
from spacy.pipeline import TextCategorizer
from spacy.gold import GoldParse
nlp = spacy.load('en')

doc1 = nlp(u'This is my first document in the dataset.')
doc2 = nlp(u'This is my second document in the dataset.')

gold1 = GoldParse(doc=doc1, cats={'Category1': 1, 'Category2': 0})
gold2 = GoldParse(doc=doc2, cats={'Category1': 0, 'Category2': 1})

textcat = TextCategorizer(nlp.vocab)
textcat.add_label('Category1')
textcat.add_label('Category2')
losses = {}
optimizer = textcat.begin_training()
textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)

Thanks to @abarnert in the comments for helping me debug this.

Upvotes: 1

Related Questions