Reputation: 1021
I choose spacy to process kinds of text because of the performance of it's lemmatation compared with nltk. But When I process millions short text, it always consumed all of my memory(32G) and crashed. Without it just a few minutes and less than 10G mem is consumed.
Is something wrong with the usage of this method? is there any better solution to improve the performance? Thanks!
def tokenizer(text):
try:
tokens = [ word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
tokens = list(filter(lambda t: t.lower() not in stop_words, tokens))
tokens = list(filter(lambda t: t not in punctuation, tokens))
tokens = list(filter(lambda t: len(t) > 4, tokens))
filtered_tokens = []
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
spacy_parsed = nlp(' '.join(filtered_tokens))
filtered_tokens = [token.lemma_ for token in spacy_parsed]
return filtered_tokens
except Exception as e:
raise e
Dask parrallel computing
ddata = dd.from_pandas(res, npartitions=50)
def dask_tokenizer(df):
df['text_token'] = df['text'].map(tokenizer)
return df
%time res_final = ddata.map_partitions(dask_tokenizer).compute(get=get)
Info about spaCy
spaCy version 2.0.5
Location /opt/conda/lib/python3.6/site-packages/spacy
Platform Linux-4.4.0-103-generic-x86_64-with-debian-stretch-sid
Python version 3.6.3
Models en, en_default
Upvotes: 8
Views: 8832
Reputation: 803
This answer is based on pmbaumgartner's answer; thanks pmbaumgartner. I just added postags to filter vocabulary (needed in some text analytics):
allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'] # or any other types
def token_filter(token):
return (token.pos_ in allowed_postags) & (not (token.is_punct | token.is_space |
token.is_stop | len(token.text) <= 2))
Upvotes: -1
Reputation: 712
You can use multithreading in spacy to create a fast tokenization and data ingestion pipeline.
Rewriting your code block and functionality using the nlp.pipe
method would look something like this:
import spacy
nlp = spacy.load('en')
docs = df['text'].tolist()
def token_filter(token):
return not (token.is_punct | token.is_space | token.is_stop | len(token.text) <= 4)
filtered_tokens = []
for doc in nlp.pipe(docs):
tokens = [token.lemma_ for token in doc if token_filter(token)]
filtered_tokens.append(tokens)
This way puts all your filtering into the token_filter
function, which takes in a spacy token and returns True
only if it is not punctuation, a space, a stopword, and 4 or less characters. Then, you use this function as you pass through each token in each document, where it will return the lemma only if it meets all of those conditions. Then, filtered_tokens
is a list of your tokenized documents.
Some helpful references for customizing this pipeline would be:
Upvotes: 12
Reputation: 1078
You should filter out tokens after parsing. This way the trained model will give better tagging (unless it was trained on text filtered in a similar way, which is unlikely).
Also, filtering afterwards makes it possible to use nlp.pipe
, which is told to be fast. See the nlp.pipe
example at http://spacy.io/usage/spacy-101#lightning-tour-multi-threaded.
Upvotes: 2