Reputation: 25
Question
I have a data frame with +90,000 rows and with a column ['text'] that contains the text of some news.
The length of the text has an average of 3.000 words and when I pass the word_tokenize it makes it very slow, Which could be a more efficent method to do it?
from nltk.tokenize import word_tokenize
df['tokenized_text'] = df.iloc[0:10]['texto'].apply(word_tokenize)
df.head()
Also word_tokenize hasn't some punctuations and other characters that I don't want, so I created a function to filter them where I'm using spacy.
from spacy.lang.es.stop_words import STOP_WORDS
from nltk.corpus import stopwords
spanish_stopwords = set(stopwords.words('spanish'))
otherCharacters = ['`','�',' ','\xa0']
def tokenize(phrase):
sentence_tokens = []
tokenized_phrase = nlp(phrase)
for token in tokenized_phrase:
if ~token.is_punct or ~token.is_stop or ~(token.text.lower() in spanish_stopwords) or ~(token.text.lower() in otherCharacters) or ~(token.text.lower() in STOP_WORDS):
sentence_tokens.append(token.text.lower())
return sentence_tokens
Any other better method to do it?
Thanks for reading my maybe noob👨🏽💻 question😀, have a nice day🌻.
Appreciations
import spacy
import es_core_news_sm
nlp = es_core_news_sm.load()
Upvotes: 1
Views: 1751
Reputation: 2565
In order to make spacy
faster when you only wish to tokenize.
you can change:
nlp = es_core_news_sm.load()
To:
nlp = spacy.load("es_core_news_sm", disable=["tagger", "ner", "parser"])
A small explanation:
Spacy gives a full language model which not merely tokenize your sentence but also do parsing, and pos and ner tagging. when actually most of the calculation time is being done for the other tasks (parse tree, pos, ner) and not the tokenization which is actually much 'lighter' task, computation wise.
But, as you can see spacy allow you to use only what you actually need and by that save you some time.
Another thing, you can make your function more efferent by lowering token only once and add the stop word to spacy (even if you didn't want to do so, the fact that otherCharacters
is a list and not a set is not very efficient ).
I would also add this:
for w in stopwords.words('spanish'):
nlp.vocab[w].is_stop = True
for w in otherCharacters:
nlp.vocab[w].is_stop = True
for w in STOP_WORDS:
nlp.vocab[w].is_stop = True
and than:
for token in tokenized_phrase:
if not token.is_punct and not token.is_stop:
sentence_tokens.append(token.text.lower())
Upvotes: 1
Reputation: 11484
If you are only tokenizing, use a blank model (which only contains a tokenizer) instead of es_core_news_sm
:
nlp = spacy.blank("es")
Upvotes: 2