Julie
Julie

Reputation: 77

Is Stanza stanza library very slow

I have two sets of codes to count the number of sentences in one text file. The two options generate different results and Option 2(Stanza) is very slow. Is Option 2(Stanza) more accurate? How should I speedup Option 2(Stanza)? Thanks a lot!

Option 1 (Regular expression): The following codes takes 2 seconds and the output is 1444.

import requests
from bs4 import BeautifulSoup
import re
sentence_regex = re.compile(r"\b[A-Z](?:[^\.!?]|\.\d)*[\.!?]")
def identify_sentences(input_text:str):
    """Returns all sentences in the input text"""
    sentences = re.findall(sentence_regex, input_text)
    return sentences 
r=requests.get("https://www.sec.gov/Archives/edgar/data/861439/0000912057-94-000263.txt", headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
text=soup.text

sentences=identify_sentences(text)
len(sentences)

Option 2(Stanza): The following codes takes 6 minutes and the output is 2481.

import requests
from bs4 import BeautifulSoup
import stanza
nlp=stanza.Pipeline(lang='en', processors='tokenize, pos, ner')
r=requests.get("https://www.sec.gov/Archives/edgar/data/861439/0000912057-94-000263.txt", headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
text=soup.text

doc=nlp(text)
sentences=doc.sentences
len(sentences)

Upvotes: 1

Views: 1034

Answers (1)

Christopher Manning
Christopher Manning

Reputation: 9450

Two answers:

  1. If all you're wanting to do is to split text into sentences, then your pipeline should be simply nlp=stanza.Pipeline(lang='en', processors='tokenize') and that will be much faster than the pipeline you show that also runs a part-of-speech tagger and named entity recognizer.
  2. But, yes, running Stanza is way slower than simply doing matching against a single regex. There should be many places where it works differently and better, because exclamation marks, question marks, and especially periods often occur in the middle of English sentences (e.g., here!). You'll have to decide for yourself whether the better accuracy is worth it to you.

Upvotes: 1

Related Questions