Reputation: 21
I'm relatively new to Python NLP and I am trying to process a CSV file with SpaCy. I'm able to load the file just fine using Pandas, but when I attempt to process it with SpaCy's nlp function, the compiler errors out approximately 5% of the way through the file's contents.
Code block follows:
import pandas as pd
df = pd.read_csv('./reviews.washington.dc.csv')
import spacy
nlp = spacy.load('en')
for parsed_doc in nlp.pipe(iter(df['comments']), batch_size=1, n_threads=4):
print (parsed_doc.text)
I've also tried:
df['parsed'] = df['comments'].apply(nlp)
with the same result.
The traceback I'm receiving is:
Traceback (most recent call last):
File "/Users/john/Downloads/spacy_load.py", line 11, in <module>
for parsed_doc in nlp.pipe(iter(df['comments']), batch_size=1,
n_threads=4):
File "/usr/local/lib/python3.6/site-packages/spacy/language.py",
line 352, in pipe for doc in stream:
File "spacy/syntax/parser.pyx", line 239, in pipe
(spacy/syntax/parser.cpp:8912)
File "spacy/matcher.pyx", line 465, in pipe (spacy/matcher.cpp:9904)
File "spacy/syntax/parser.pyx", line 239, in pipe (spacy/syntax/parser.cpp:8912)
File "spacy/tagger.pyx", line 231, in pipe (spacy/tagger.cpp:6548)
File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 345,
in <genexpr> stream = (self.make_doc(text) for text in texts)
File "/usr/local/lib/python3.6/site-packages/spacy/language.py", line 293,
in <lambda> self.make_doc = lambda text: self.tokenizer(text)
TypeError: Argument 'string' has incorrect type (expected str, got float)
Can anyone shed some light on why this is happening, as well as how I might work around it? I've tried various workarounds from the site to no avail. Try/except blocks have had no effect, either.
Upvotes: 2
Views: 2975
Reputation: 2439
I've just been experiencing a very similar error to the one you received.
>>> c.add_texts(df.DetailedDescription.astype('object'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\textacy\corpus.py", line 297, in add_texts
for i, spacy_doc in enumerate(spacy_docs):
File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\spacy\language.py", line 554, in pipe
for doc in docs:
File "nn_parser.pyx", line 369, in pipe
File "cytoolz/itertoolz.pyx", line 1046, in cytoolz.itertoolz.partition_all.__
next__ (cytoolz/itertoolz.c:14538)
for item in self.iterseq:
File "nn_parser.pyx", line 369, in pipe
File "cytoolz/itertoolz.pyx", line 1046, in cytoolz.itertoolz.partition_all.__
next__ (cytoolz/itertoolz.c:14538)
for item in self.iterseq:
File "pipeline.pyx", line 395, in pipe
File "cytoolz/itertoolz.pyx", line 1046, in cytoolz.itertoolz.partition_all.__
next__ (cytoolz/itertoolz.c:14538)
for item in self.iterseq:
File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\spacy\language.py", line 534, in <genexpr>
docs = (self.make_doc(text) for text in texts)
File "C:\Users\Anonymous\AppData\Local\Programs\Python\Python36\lib\site
-packages\spacy\language.py", line 357, in make_doc
return self.tokenizer(text)
TypeError: Argument 'string' has incorrect type (expected str, got float)
Finally, I happened across a solution, which was to use the Pandas data frame to cast values to Unicode before retrieving the values as a native array and feeding that into the add_texts
method of the Textacy Corpus
object.
c = textacy.corpus.Corpus(lang='en_core_web_lg')
c.add_texts(df.DetailedDescription.astype('unicode').values)
df.DetailedDescription.astype('unicode').values
Doing this allowed me to add all texts to my corpus, despite trying to forcefully load a Unicode compliant file before hand (snippet included below in case that helps others).
with codecs.open('Base Data\Base Data.csv', 'r', encoding='utf-8', errors='replace') as base_data:
df = pd.read_csv(StringIO(re.sub(r'(?!\n)[\x00-\x1F\x80-\xFF]', '', base_data.read())), dtype={"DetailedDescription":object, "OtherDescription":object}, na_values=[''])
Upvotes: 1