Is Stanza stanza library very slow

Question

I have two sets of codes to count the number of sentences in one text file. The two options generate different results and Option 2(Stanza) is very slow. Is Option 2(Stanza) more accurate? How should I speedup Option 2(Stanza)? Thanks a lot!

Option 1 (Regular expression): The following codes takes 2 seconds and the output is 1444.

import requests
from bs4 import BeautifulSoup
import re
sentence_regex = re.compile(r"\b[A-Z](?:[^\.!?]|\.\d)*[\.!?]")
def identify_sentences(input_text:str):
    """Returns all sentences in the input text"""
    sentences = re.findall(sentence_regex, input_text)
    return sentences 
r=requests.get("https://www.sec.gov/Archives/edgar/data/861439/0000912057-94-000263.txt", headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
text=soup.text

sentences=identify_sentences(text)
len(sentences)

Option 2(Stanza): The following codes takes 6 minutes and the output is 2481.

import requests
from bs4 import BeautifulSoup
import stanza
nlp=stanza.Pipeline(lang='en', processors='tokenize, pos, ner')
r=requests.get("https://www.sec.gov/Archives/edgar/data/861439/0000912057-94-000263.txt", headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
text=soup.text

doc=nlp(text)
sentences=doc.sentences
len(sentences)

Is Stanza stanza library very slow

Answers (1)

Related Questions