Reputation: 77
I have two sets of codes to count the number of sentences in one text file. The two options generate different results and Option 2(Stanza) is very slow. Is Option 2(Stanza) more accurate? How should I speedup Option 2(Stanza)? Thanks a lot!
Option 1 (Regular expression): The following codes takes 2 seconds and the output is 1444.
import requests
from bs4 import BeautifulSoup
import re
sentence_regex = re.compile(r"\b[A-Z](?:[^\.!?]|\.\d)*[\.!?]")
def identify_sentences(input_text:str):
"""Returns all sentences in the input text"""
sentences = re.findall(sentence_regex, input_text)
return sentences
r=requests.get("https://www.sec.gov/Archives/edgar/data/861439/0000912057-94-000263.txt", headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
text=soup.text
sentences=identify_sentences(text)
len(sentences)
Option 2(Stanza): The following codes takes 6 minutes and the output is 2481.
import requests
from bs4 import BeautifulSoup
import stanza
nlp=stanza.Pipeline(lang='en', processors='tokenize, pos, ner')
r=requests.get("https://www.sec.gov/Archives/edgar/data/861439/0000912057-94-000263.txt", headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
text=soup.text
doc=nlp(text)
sentences=doc.sentences
len(sentences)
Upvotes: 1
Views: 1034
Reputation: 9450
Two answers:
nlp=stanza.Pipeline(lang='en', processors='tokenize')
and that will be much faster than the pipeline you show that also runs a part-of-speech tagger and named entity recognizer.Upvotes: 1