Tokenize text - Very slow when doing it

Question

Question

I have a data frame with +90,000 rows and with a column ['text'] that contains the text of some news.

The length of the text has an average of 3.000 words and when I pass the word_tokenize it makes it very slow, Which could be a more efficent method to do it?

from nltk.tokenize import word_tokenize
df['tokenized_text'] = df.iloc[0:10]['texto'].apply(word_tokenize) 
df.head()

Also word_tokenize hasn't some punctuations and other characters that I don't want, so I created a function to filter them where I'm using spacy.

from spacy.lang.es.stop_words import STOP_WORDS
from nltk.corpus import stopwords
spanish_stopwords = set(stopwords.words('spanish'))
otherCharacters = ['`','�',' ','\xa0']
def tokenize(phrase):
    sentence_tokens = []
    tokenized_phrase = nlp(phrase)
    for token in tokenized_phrase:
        if ~token.is_punct or ~token.is_stop or ~(token.text.lower() in spanish_stopwords) or ~(token.text.lower() in otherCharacters) or ~(token.text.lower() in STOP_WORDS):
            sentence_tokens.append(token.text.lower())
    return sentence_tokens

Any other better method to do it?

Thanks for reading my maybe noob👨🏽‍💻 question😀, have a nice day🌻.

Appreciations

nlp is defined before

import spacy
import es_core_news_sm
nlp = es_core_news_sm.load()

I'm using spacy to tokenize but also using the nltk stop_words for spanish language.

Green · Accepted Answer

In order to make spacy faster when you only wish to tokenize.
you can change:

nlp = es_core_news_sm.load()

To:

nlp = spacy.load("es_core_news_sm", disable=["tagger", "ner", "parser"])

A small explanation:
Spacy gives a full language model which not merely tokenize your sentence but also do parsing, and pos and ner tagging. when actually most of the calculation time is being done for the other tasks (parse tree, pos, ner) and not the tokenization which is actually much 'lighter' task, computation wise.
But, as you can see spacy allow you to use only what you actually need and by that save you some time.

Another thing, you can make your function more efferent by lowering token only once and add the stop word to spacy (even if you didn't want to do so, the fact that otherCharacters is a list and not a set is not very efficient ).

I would also add this:

for w in stopwords.words('spanish'):
    nlp.vocab[w].is_stop = True
for w in otherCharacters:
    nlp.vocab[w].is_stop = True
for w in STOP_WORDS:
    nlp.vocab[w].is_stop = True

and than:

for token in tokenized_phrase:
    if not token.is_punct and  not token.is_stop:
        sentence_tokens.append(token.text.lower())

Tokenize text - Very slow when doing it

Answers (2)

Related Questions