Best method for creating Python Spacy NLP objects from a Pandas Series

Question

I would want to create Spacy nlp objects out of 250k string objects stored in a Pandas data frame column. Is there a way of optimizing the following "apply" approach, i.e., is there some way of vectorizing the call of the spacy nlp object?

import pandas as pd
import spacy

nlp = spacy.load("en_core_web_sm")

df = pd.DataFrame({"id": [1, 2, 3], "text": ["this is a text", "another easy one", "oh you come on"]})

df["nlp"] = df.apply(lambda x: nlp(x.text), axis=1)

thorntonc · Accepted Answer

From my tests on a corpus of 29,071 strings, a faster method than apply is with nlp.pipe

import pandas as pd
import spacy
from time import time
from nltk.corpus import webtext

nlp = spacy.load("en_core_web_sm")  
texts = webtext.raw().split('
')
df = pd.DataFrame({"text":texts})

#apply method
start = time()
df["nlp"] = df.apply(lambda x: nlp(x.text), axis=1)
end = time()
print(end - start)

# batch method
start = time()
df["nlp"] = [doc for doc in nlp.pipe(df["text"].tolist())]
end = time()
print(end - start)
#print(Counter([tok.dep_ for tok in doc if tok.pos_=='VERB']))

Output:

apply method: 209.74427151679993
batch method: 51.40181493759155

Best method for creating Python Spacy NLP objects from a Pandas Series

Answers (1)

Related Questions