Reputation: 1
I have imported an Excel file as Pandas Dataframe. This file consists of >4000 rows (documents) and 12 columns. I extracted the column 'Text' for NLP.
The text in the column 'Text' is in Dutch. I'm using a Spacy model for Dutch language 'nl_core_news_lg'
import spacy
import pandas as pd
spacy.load('nl_core_news_lg')
import nl_core_news_lg
nlp = nl_core_news_lg.load()
df = pd.read_excel (*file path*)
text_article = (df['Text'])
I have preprocessed df['Text'']. I've removed digits and interpunction, and converted the text to all lower case. Resulting in the following variable: text_article['lower']
Next, I've tokenized the text.
def tokenization(text):
tokens = re.split('W+',text)
return tokens
text_article['tokens'] = text_article['lower'].apply(lambda x: nlp.tokenizer(x))
I now want to add Part-Of-Speech (POS) tags to every token. Hereafter, I want to remove all tokens with the POS-tag 'VERB'.
I've tried the following code.
text_article['final'] = text_article['tokens'].apply(lambda text: " ".join(token.lemma_ for token in nlp(text) if not token.is_stop or token.pos_ == 'VERB'))
This code does not produce an error. But when I print a document as an example (e.g. doc 42) the text still includes verbs.
print(text_article['final'][42])
I'm running out of ideas here and really hope somebody can help me out! Thanks in advance.
Upvotes: 0
Views: 428
Reputation: 70
Try if not token.is_stop and token.pos_ != 'VERB'
it's the same as if not (token.is_stop or token.pos_ == 'VERB')
Also, do you really need the 'tokens' column ? Otherwise you should compute 'final' from 'lower', applying both tokenization and pos tagging with one .apply() and not create a 'tokens' column. The execution of your code should be faster.
Last thing, why do you use your own tokenization ?
Upvotes: 1