How to remove POS-tag 'VERBS' from dataframe

Question

I have imported an Excel file as Pandas Dataframe. This file consists of >4000 rows (documents) and 12 columns. I extracted the column 'Text' for NLP.

The text in the column 'Text' is in Dutch. I'm using a Spacy model for Dutch language 'nl_core_news_lg'

import spacy 
import pandas as pd

spacy.load('nl_core_news_lg')
import nl_core_news_lg
nlp = nl_core_news_lg.load()

df = pd.read_excel (*file path*)
text_article = (df['Text'])

I have preprocessed df['Text'']. I've removed digits and interpunction, and converted the text to all lower case. Resulting in the following variable: text_article['lower']

Next, I've tokenized the text.

def tokenization(text):
    tokens = re.split('W+',text)
    return tokens

text_article['tokens'] = text_article['lower'].apply(lambda x: nlp.tokenizer(x))

I now want to add Part-Of-Speech (POS) tags to every token. Hereafter, I want to remove all tokens with the POS-tag 'VERB'.

I've tried the following code.

text_article['final'] = text_article['tokens'].apply(lambda text: " ".join(token.lemma_ for token in nlp(text) if not token.is_stop or token.pos_ == 'VERB'))

This code does not produce an error. But when I print a document as an example (e.g. doc 42) the text still includes verbs.

print(text_article['final'][42])

I'm running out of ideas here and really hope somebody can help me out! Thanks in advance.

How to remove POS-tag 'VERBS' from dataframe

Answers (1)

Related Questions

How to remove POS-tag &#39;VERBS&#39; from dataframe

Answers (1)

Related Questions

How to remove POS-tag 'VERBS' from dataframe