J. Johnson
J. Johnson

Reputation: 21

Python SpaCy Create nlp Document - Argument 'string' has incorrect type

I'm relatively new to Python NLP and I am trying to process a CSV file with SpaCy. I'm able to load the file just fine using Pandas, but when I attempt to process it with SpaCy's nlp function, the compiler errors out approximately 5% of the way through the file's contents.

Code block follows:

import pandas as pd
df = pd.read_csv('./reviews.washington.dc.csv')

import spacy
nlp = spacy.load('en')

for parsed_doc in nlp.pipe(iter(df['comments']), batch_size=1, n_threads=4):
    print (parsed_doc.text)

I've also tried:

df['parsed'] = df['comments'].apply(nlp)

with the same result.

The traceback I'm receiving is:

Traceback (most recent call last):
    File "/Users/john/Downloads/spacy_load.py", line 11, in <module>
        for parsed_doc in nlp.pipe(iter(df['comments']), batch_size=1,
        n_threads=4):
    File "/usr/local/lib/python3.6/site-packages/spacy/language.py",
        line 352, in pipe for doc in stream:
    File "spacy/syntax/parser.pyx", line 239, in pipe
        (spacy/syntax/parser.cpp:8912)
    File "spacy/matcher.pyx", line 465, in pipe (spacy/matcher.cpp:9904)
    File "spacy/syntax/parser.pyx", line 239, in pipe (spacy/syntax/parser.cpp:8912)
    File "spacy/tagger.pyx", line 231, in pipe (spacy/tagger.cpp:6548)
    File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 345,
        in <genexpr> stream = (self.make_doc(text) for text in texts)
    File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 293,
        in <lambda> self.make_doc = lambda text: self.tokenizer(text)
    TypeError: Argument 'string' has incorrect type (expected str, got float)

Can anyone shed some light on why this is happening, as well as how I might work around it? I've tried various workarounds from the site to no avail. Try/except blocks have had no effect, either.

Upvotes: 2

Views: 2975

Answers (1)

QA Collective
QA Collective

Reputation: 2439

I've just been experiencing a very similar error to the one you received.

>>> c.add_texts(df.DetailedDescription.astype('object'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\textacy\corpus.py", line 297, in add_texts
    for i, spacy_doc in enumerate(spacy_docs):
  File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\spacy\language.py", line 554, in pipe
    for doc in docs:
  File "nn_parser.pyx", line 369, in pipe
  File "cytoolz/itertoolz.pyx", line 1046, in cytoolz.itertoolz.partition_all.__
next__ (cytoolz/itertoolz.c:14538)
    for item in self.iterseq:
  File "nn_parser.pyx", line 369, in pipe
  File "cytoolz/itertoolz.pyx", line 1046, in cytoolz.itertoolz.partition_all.__
next__ (cytoolz/itertoolz.c:14538)
    for item in self.iterseq:
  File "pipeline.pyx", line 395, in pipe
  File "cytoolz/itertoolz.pyx", line 1046, in cytoolz.itertoolz.partition_all.__
next__ (cytoolz/itertoolz.c:14538)
    for item in self.iterseq:
  File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\spacy\language.py", line 534, in <genexpr>
    docs = (self.make_doc(text) for text in texts)
  File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\spacy\language.py", line 357, in make_doc
    return self.tokenizer(text)
TypeError: Argument 'string' has incorrect type (expected str, got float)

Finally, I happened across a solution, which was to use the Pandas data frame to cast values to Unicode before retrieving the values as a native array and feeding that into the add_texts method of the Textacy Corpus object.

c = textacy.corpus.Corpus(lang='en_core_web_lg')
c.add_texts(df.DetailedDescription.astype('unicode').values)
df.DetailedDescription.astype('unicode').values

Doing this allowed me to add all texts to my corpus, despite trying to forcefully load a Unicode compliant file before hand (snippet included below in case that helps others).

with codecs.open('Base Data\Base Data.csv', 'r', encoding='utf-8', errors='replace') as base_data:
  df = pd.read_csv(StringIO(re.sub(r'(?!\n)[\x00-\x1F\x80-\xFF]', '', base_data.read())), dtype={"DetailedDescription":object, "OtherDescription":object}, na_values=[''])

Upvotes: 1

Related Questions