Annick
Annick

Reputation: 1

How to remove POS-tag 'VERBS' from dataframe

I have imported an Excel file as Pandas Dataframe. This file consists of >4000 rows (documents) and 12 columns. I extracted the column 'Text' for NLP.

The text in the column 'Text' is in Dutch. I'm using a Spacy model for Dutch language 'nl_core_news_lg'

import spacy 
import pandas as pd

spacy.load('nl_core_news_lg')
import nl_core_news_lg
nlp = nl_core_news_lg.load()

df = pd.read_excel (*file path*)
text_article = (df['Text'])

I have preprocessed df['Text'']. I've removed digits and interpunction, and converted the text to all lower case. Resulting in the following variable: text_article['lower']

Next, I've tokenized the text.

def tokenization(text):
    tokens = re.split('W+',text)
    return tokens

text_article['tokens'] = text_article['lower'].apply(lambda x: nlp.tokenizer(x)) 

I now want to add Part-Of-Speech (POS) tags to every token. Hereafter, I want to remove all tokens with the POS-tag 'VERB'.

I've tried the following code.

text_article['final'] = text_article['tokens'].apply(lambda text: " ".join(token.lemma_ for token in nlp(text) if not token.is_stop or token.pos_ == 'VERB'))

This code does not produce an error. But when I print a document as an example (e.g. doc 42) the text still includes verbs.

print(text_article['final'][42])

I'm running out of ideas here and really hope somebody can help me out! Thanks in advance.

Upvotes: 0

Views: 428

Answers (1)

JulienBr
JulienBr

Reputation: 70

Try if not token.is_stop and token.pos_ != 'VERB' it's the same as if not (token.is_stop or token.pos_ == 'VERB')

Also, do you really need the 'tokens' column ? Otherwise you should compute 'final' from 'lower', applying both tokenization and pos tagging with one .apply() and not create a 'tokens' column. The execution of your code should be faster.

Last thing, why do you use your own tokenization ?

Upvotes: 1

Related Questions