Reputation: 537
I'm looking for a way to improve the performance of my parser I built with pyparsing
. I read up on packrat parsing and it seems like this could really help with my parser's performance. However, when I enabled packrat parsing, the performance got worse! Without packrat, it takes about 2 minutes for parsing a 20 MB file. With packrat enabled, it takes 2-3 times more. I read that packrat can have problems with parseActions, so I removed all parseActions from my grammar to see if packrat would improve its performance then, but that didn't help either. I tried different cache size limits (unlimited and in the range of 100-1000), but all these approaches instead worsened my parser's performance when I enabled packrat.
Is there a rule of thumb for setting packrat's cache_size_limit? And are there any grammatical constructs which limit the use of packrat parsing or explain why my parser's performance got worse?
Upvotes: 2
Views: 624
Reputation: 63709
A 20MB file is going to take some time with pyparsing no matter what, but there are some constructs that might be slowing you down.
When we redid the implementation of packrat parsing to use an OrderedDict for the cache, I did some testing of different values for the cache size. The default value of 128 turned out to be only about 1-2% less efficient than an unbounded cache, with huge wins in both memory and performance. So I'd be surprised if values more than 200 or 300 would be of much cacheing help, while paying penalties in cache searching and memory management.
Packrat parsing works best when you get cache hits by revisiting the same expression at the same location of the input string. It's important that it is the exact same expression, not just an equivalent one. For instance, if you had something like:
Literal("A") + Optional("," + Literal("B")) + Optional("," + Literal("C"))
the separate Literals for the delimiting commas are different expressions, and so would not be cache hits. If instead you define and reuse a single expression for common punctuation, you have a better chance at packrat parsing being able to lookup its parsed results from the cache instead of reparsing:
COMMA = Literal(",")
Literal("A") + Optional(COMMA + Literal("B")) + Optional(COMMA + Literal("C"))
Lately I've been using expr()
syntax to create copies of expr
so that modifiers like parse actions, whitespace settings, etc. don't get accidentally changed globally. This will work against expression reuse, so if you have many instances of expr()
that are not required, then just reuse the base expr
. Note also that expressions with results names make implicit copies, so be sure to not overuse results names.
Packrat parsing is best when using infixNotation
, especially when there are 6 or more level of operator precedence. The expressions that infixNotation
generates reuse subexpressions rather than defining new ones, so it is able to get better cache performance.
You may be overusing '^' operator for Or
when you could use '|' operator for MatchFirst
. This can be especially expensive when used in a recursive expression (using Forward
). Compare these two expressions:
arith_operand = integer_expr ^ float_expr ^ variable_name ^ Keyword("pi") ^ Keyword("e")
By using '^', all the expressions will get evaluated, and the longest match is then chosen. In particular, if parsing "3.1416", it is necessary since the leading "3" would match integer
, but the longer "3.1416" would match float_expr
. But we also try to match a variable name, and the keywords "pi" and "e", which are guaranteed no matches. (We have the same expression masking problem with variable_name
and "pi" and "e", since they would presumably match as a possible variable name.) But if we change to using '|', then our parser will short-circuit on the first match. We just have to take some care in the order things get parsed:
arith_operand = float_expr | integer_expr | Keyword("pi") | Keyword("e") | variable_name
That is, we have to be sure we check for a match on a float before matching an integer, since the leading part of a float could be mistaken for an integer.
If you have easily mutually-exclusive alternatives (such as a series of possible keywords, which by virtue of their keyword-ness, can never mask each other), then you could also order them based on some likely frequency. For instance:
bad_expr = Keyword("xylophone") | Keyword("the") | Keyword("a")
is probably going to be a performance loser in most non-musical applications.
In the early days of pyparsing, I did not have the Regex
class, so I would have to define an expression for a real number using something like Combine(Word(nums) + "." + Optional(Word(nums)))
. That is a lot of work for pyparsing, when you can just use Regex(r"\d+\.\d*")
. Typically, real number expressions are really low-level expressions used and tested thousands of times in a given parsing run, so convering to a Regex
really pays off. Or use one of the numeric expressions in pyparsing_common
like pyparsing_common.real
or pyparsing_common.number
. You don't need to go overboard though - pyparsing internally uses regexes for matching expressions using Word
and oneOf
.
You also can examine the cache and the cache stats directly, ParserElement.packrat_cache
and ParserElement.packrat_cache_stats
. The cache stats are a two-element list, element 0 is the number of cache hits, element 1 is the number of misses. You could define a debug action for a recurring expression that would print out the cache stats. You could also look for duplicated expressions by using Counter: So you can look at the cache efficiency for different values of cache size.Counter(str(key[0]) for key in ParserElement.packrat_cache)
. By str
-ing the expressions, it will help identify duplicates.
EDIT: my mistake, the iteration over the cache keys in ParserElement.packrat_cache
does not work, the cache OrderedDict
itself is hidden away from external access.
Upvotes: 2