Reputation: 895
I'm using a function from pycontractions to try to expand text into its grammatically accurate form. It works but its incredibly slow and I cannot help but wonder if I am doing something unnecessary that is contributing to the lag in performance. For reference, it is taking close to a minute to output.
from pycontractions import Contractions
def cont_expand(a):
cont = Contractions(api_key="glove-twitter-100")
expText = cont.expand_texts(a, precise=False)
return expText
mystr = ["I'd like to have lunch today"]
x = list(cont_expand(mystr))
Upvotes: 1
Views: 1295
Reputation: 54173
I hadn't previously known about or worked with the pycontractions
library, but from a quick look, I have a few ideas.
First and foremost, regarding your usage:
The Contractions
object needs to load-from-disk a large set of pre-existing word-vectors to do some of its analysis. It also needs to instantiate another library language-check
which apparently wraps on a Java-based grammar-checking utility.
Looking at the source code, I see that it actually does these initializations in a lazy fashion when first needed during expand_texts()
, rather than during the object initialization when api_key='glove-twitter-100'
was provided.)
On a small text like your probe value, that might be the largest contributor to runtime. So, a single expand_texts()
on a just-initialized Contractions
object won't be an accurate indicator of that object's performance on following similar texts. So, assuming your real usage will be on more than one text per Python-invocation, you should:
Contractions
objectFor example:
from pycontractions import Contractions
PYCNTRCTNS = Contractions(api_key="glove-twitter-100")
# dummy call to force vector/grammar loading
PYCNTRCTNS.expand_texts([]) # expect this to take a while
def cont_expand(a):
expText = PYCNTRCTNS.expand_texts(a, precise=False)
return expText
mystr = ["I'd like to have lunch today"]
x = list(cont_expand(mystr)) # care about how long this takes
Other than that, your usage is pretty simple, and I don't see other things you can do, by calling that library differently, to speed things up.
However, looking a little into how pycontractions
works, I wouldn't be surprised that it's fairly slow, especially on large texts. The things it's doing, internally, are often fairly-slow processes, and it's additionally doing them in ways that aren't heavily optimized – which may be perfectly fine, for simplicity of code, and especially on short texts, unless/until higher performance is needed.
For example, it describes using a "three-pass" approach.
The first pass involves a number of pattern-based replacements, for which the source code has hundreds of individual regular-expressions. Every text needs to be regular-expression-matched across these hundreds of expressions, in a loop, to perform this 1st step. (There are ways to optimize this to use fewer passes.)
For contractions that have multiple possible expansions – which includes the "I'd" in your test string – it performs each expansion & checks its grammar. Fortunately, this only involves a few expansions, but grammar-checking isn't the cheapest of operations, either.
For every alternate expansion, it calculates a word-vector-based semantic-difference measure called "Word Mover's Distance" from the original text, that itself can be quite expensive, especially on longer texts. (It's doing this from scratch for each alternate – even though except for a couple words, each alternate starts identical – and even after it's found at least one grammatical option, it continues doing this calculation for non-grammatical candidates that have no chance of being chosen.)
And at each step, it's keeping interim results as raw strings, so either pycontractions
code or the individual supporting libraries' code is repeatedly performing the same tokenization steps.
So: if you were doing this in bulk, and fixes to the underlying library were in scope, there's likely a lot of room for micro-optimizations.
But I think for many casual uses, just being sure you aren't repeatedly paying the Contractions
initialization-loading cost on each operations may be enough of an improvement.
Upvotes: 1