Reputation: 602
I'm trying to apply text preprocessing to a pandas column, with spacy. My goal is to apply preprocessing and then use this clean column for further analysis with other columns.
Data:
category content
0 business Quarterly profits at US media giant TimeWarne...
1 business The dollar has hit its highest level against ...
2 business The owners of embattled Russian oil giant Yuk...
3 business British Airways has blamed high fuel prices f...
4 business Shares in UK drinks and food firm Allied Dome...
My preprocessing:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(str(df['content']))
new_corpus = [[words.lemma_ for words in docs if (not words.is_stop and not words.is_punct and not words.like_num)] for docs in doc]
corpus_clean = [[word.lower() for word in docu if (word.isalpha())] for docu in new_corpus]
Error:
TypeError: 'spacy.tokens.token.Token' object is not iterable
Upvotes: 2
Views: 589
Reputation: 316
You have a problem with the dataframe conversion.
You wanted to get a list of 'content' but instead you turned the content column into a string.
You should change this line :
doc = nlp(str(df['content']))
To this:
doc = nlp.pipe(df['content'].tolist())
Upvotes: 1